Beruflich Dokumente
Kultur Dokumente
1 JJ J I II J • X 1 JJ J I II J • X
I Credit Scoring: applicant will default or not? At a particular point x the value of y is not uniquely
determined.
I SPAM filter: e-mail message is SPAM or not?
I Medical diagnosis: does patient have breast cancer? It can assume both its values with respective
I Handwritten digit recognition. probabilities that depend on the location of the point x
I Music Genre Classification. in the input space. We write
p(y = 1|x) = 1 − p(y = 0|x).
The goal of a classification procedure is to produce an
estimate of p(y|x) at every input point x.
2 JJ J I II J • X 3 JJ J I II J • X
4 JJ J I II J • X 5 JJ J I II J • X
Examples of discriminative classification methods: An alternative paradigm for estimating p(y|x) is based
on density estimation. Here Bayes’ theorem
I Linear probability model (this lecture) p(y = 1)p(x|y = 1)
p(y = 1|x) =
I Logistic regression (this lecture) p(y = 1)p(x|y = 1) + p(y = 0)p(x|y = 0)
I Classification Trees (Book: section 4.3) is applied where p(x|y) are the class conditional
I Feed-forward neural networks (Book: section 5.4) probability density functions and p(y) are the
unconditional (prior) probabilities of each class.
I ...
6 JJ J I II J • X 7 JJ J I II J • X
Examples of density estimation based classification Consider the linear regression model
y = wT x + ε y ∈ {0, 1}
methods:
Note:
I Linear/Quadratic Discriminant Analysis (not
1
discussed), x1
wT = [w0 w1 . . . wd], x = .. .
.
I Naive Bayes classifier (Book: section 5.3), xd
so wT x = w0 + di=1 wixi.
P
I...
By assumption E[ε|x] = 0, so we have
E[y|x] = wT x
But
E[y|x] = 1 · p(y = 1|x) + 0 · p(y = 0|x)
= p(y = 1|x)
8 JJ J I II J • X 9 JJ J I II J • X
10 JJ J I II J • X 11 JJ J I II J • X
Write z = wT x:
1.0
p(y = 1|x) (1 + e−z )−1
ln = ln
E[y|x] 1 − p(y = 1|x) 1 − (1 + e−z )−1
1 1
= ln = ln −z
(1 + e−z ) − 1 e
= ln ez = z = wT x
0.5
In the second step, we divided the numerator and the denominator by (1 + e−z )−1.
The ratio p(y = 1|x)/(1 − p(y = 1|x)) is called the odds.
0.0
12 JJ J I II J • X 13 JJ J I II J • X
x2
Assign to class 1 if p(y = 1|x) > p(y = 0|x), i.e. if
p(y = 1|x)
>1 w T x = w0 + w1 x1 + w2 x2 = 0
1 − p(y = 1|x)
−10
d ln p(y|µ) 7 3
= − =0
µ 1−µ
−15
dµ
loglikelihood
which yields maximum likelihood estimate µ = 0.7. ML
−20
This is just the relative frequency of heads in the sample.
−25
Note:
−30
d ln x 1
=
dx x 0.0 0.2 0.4 0.6 0.8 1.0
18 JJ J I II J • X 19mu JJ J I II J • X
20 JJ J I II J • X 21 JJ J I II J • X
Since the yi observations are independent: Since for the logistic regression model:
T
N
Y N
Y µi = (1 + e−w xi )−1
p(y|w) = p(yi) = µyi i (1 − µi)1−yi T
1 − µi = (1 + ew xi )−1
i=1 i=1
22 JJ J I II J • X 23 JJ J I II J • X
Substitute maximum likelihood estimates into the Model the probability of succesfully completing a
response function to obtain the fitted response function programming assignment.
T
ew x
ML
Explanatory variable: “programming experience”.
p̂(y = 1|x) = T
1 + ew x ML
We find w0 = −3.0597 and w1 = 0.1615, so
e−3.0597+0.1615xi
p̂(y = 1|xi) =
1 + e−3.0597+0.1615xi
14 months of programming experience:
e−3.0597+0.1615(14)
p̂(y = 1|x = 14) = ≈ 0.31
1 + e−3.0597+0.1615(14)
24 JJ J I II J • X 25 JJ J I II J • X
month.exp success fitted month.exp success fitted Probability of the classes is equal when
1 14 0 0.310262 16 13 0 0.276802
2 29 0 0.835263 17 9 0 0.167100 −3.0597 + 0.1615x = 0
3 6 0 0.109996 18 32 1 0.891664
4 25 1 0.726602 19 24 0 0.693379
5 18 1 0.461837 20 13 1 0.276802 Solving for x we get x ≈ 18.95.
6 4 0 0.082130 21 19 0 0.502134
7 18 0 0.461837 22 4 0 0.082130
8 12 0 0.245666 23 28 1 0.811825 Allocation Rule:
9 22 1 0.620812 24 22 1 0.620812
10 6 0 0.109996 25 8 1 0.145815 x ≥ 19: assign to class 1
11 30 1 0.856299 x < 19: assign to class 0
12 11 0 0.216980
13 30 1 0.856299
14 5 0 0.095154
15 20 1 0.542404
26 JJ J I II J • X 27 JJ J I II J • X
Cross table of observed and predicted class label: Two possible causes:
0 1 a) Benign tumor (adenoma) of the adrenal cortex.
0 11 3 b) More diffuse affection of the adrenal glands (bilateral
1 3 8 hyperplasia).
Row: observed, Column: predicted
Pre-operative diagnosis on basis of
Error rate: 6/25=0.24 1. Sodium concentration (mmol/l)
2. CO2 concentration (mmol/l)
Default: 11/25=0.44
28 JJ J I II J • X 29 JJ J I II J • X
a=1, b=0
34
b b
b b b
sodium co2 cause sodium co2 cause b
32
1 140.6 30.3 0 16 139.0 31.4 0 b
2 143.0 27.1 0 17 144.8 33.5 0
b
30
3 140.0 27.0 0 18 145.7 27.4 0 b
b b
4 146.0 33.0 0 19 144.0 33.0 0 b
5 138.7 24.1 0 20 143.5 27.5 0 a
28
ab
co2
6 143.7 28.0 0 21 140.3 23.4 1 b b b
b a b
7 137.3 29.6 0 22 141.2 25.8 1
b a aa
26
8 141.0 30.0 0 23 142.0 22.0 1 a
9 143.8 32.2 0 24 143.5 27.8 1 a
10 144.6 29.5 0 25 139.7 28.0 1 b
24
11 139.5 26.0 0 26 141.1 25.0 1 a
12 144.0 33.7 0 27 141.0 26.0 1 a
22
13 145.0 33.0 0 28 140.5 27.0 1
138 140 142 144 146
14 140.2 29.1 0 29 140.0 26.0 1
15 144.7 27.4 0 30 140.0 25.6 1 sodium
30 JJ J I II J • X 31 JJ J I II J • X
34
b b
b b b
b
w0 = 36.6874320
32
b
w1 = −0.1164658 30
b
b
b b
w2 = −0.7626711 b
a
28
ab
co2
b b b
b a b
b a aa
26
Assign to group a if a
a
b
36.69 − 0.12 × sodium − 0.76 × CO2 > 0
24
sodium
32 JJ J I II J • X 33 JJ J I II J • X
Conn’s Syndrome: Confusion Matrix Conn’s Syndrome: Line with lower empirical error
b b
a b b
b b b
32
a 7 3 b
b b
b 2 18
30
b b
b
Row: observed, Column: predicted
28
co2
a a b
b b b
b a b
26
b a aa
Error rate: 5/30=1/6 a
a
24
b
a
Default: 1/3
22
a
138 140 142 144 146
sodium
34 JJ J I II J • X 35 JJ J I II J • X
b b
b b b
b
32
b
b b
30
b b
b
28
co2
a ab
b b b
b a b
26
b a aa
a
a
24
b
a
22
a
138 140 142 144 146
sodium
38 JJ J I II J • X