Statistical Learning Techniques

Statistical Learning
Part daquest material ha estat cedit per Bea Lpez, a la qual li agraeixo
Curs 2012-2013
1- Instance-Based Learning (IBL)

2- Case-Based Reasoning

3- Neural Networks

4- Support Vector Machines
CONTINGUTS DEL TEMA
1.1 Introduction

1.2 Nearest Neighbour

1.3 K-Nearest Neighbour

1.4 Reduction techniques in IBL

Model based learning. Exemple models: Transfer function,
regression, ARMAX, residues,...
Instance-Based Learning (IBL) is a paradigm of learning in which
algorithms typically store some or all of the n available training
examples (instances) from a training set, T, during learning. Each
instance has an input vector x, and an output class c. During
generalization, these systems use a distance function to
determine how close a new input vector y is to each stored
instance, and use the nearest instance or instances to predict the
output class of y (i.e., to classify y).
IBL is nonparametric as it constructs hypotheses directly from the
training data. Training is typically very simple: Just store the
training instances.
1.1- Introduction
Example: finding a function to represent
the data.
IBL: keep the data as is. Each case is an instance
Example: Response to a Saturated Ramp
time
t
rin

Amplitude
V
SAT
Unit
Under
Test

t
rin
= 100 s
V
SAT
= 1 V
time
Amplitude
SP
0,1 V
est
0,5 V
est
0,9 V
est
t
d
t r
V
est
Instance Base
SP t
r
t
d
V
est

FAULT 1 SP1 t
r1
t
d1
V
est1

FAULT 2 SP2 t
r2
t
d2
V
est2

... ... ... ... ...
FAULTm SPm t
rm
t
dm
V
estm

time
Amplitude
SP= 5.73%
0,1 V
est
0,5 V
est
0,9 V
est
t
r
=72 s
-0.98
5.73%
13 s
72 s
5.73% 13 s 72 s
-0.98
t
d
=13 s
Vest=-0.98
-1.0000 76 15 4.4029 Nominal
-1.0000 80 15 1.0031 C2-50
-1.0005 72 17 7.6834 C2+50
-1.0000 77 15 3.0781 C2-20
-0.9997 73 16 5.7085 C2+20
-0.4999 75 5 5.2359 R6-50
-1.5004 80 24 3.3711 R6+50
-0.8000 75 11 4.6614 R6-20
-1.2001 77 19 4.0473 R6+20
-0.9996 86 31 2.1315 R5-50
-1.0002 76 9 4.8682 R5+50
-1.0002 77 20 3.9447 R5-20
-0.9999 75 12 4.6189 R5+20
-1.0000 92 29 0 R4-50
-0.9990 71 10 6.9145 R4+50
-1.0000 80 19 2.4917 R4-20
-0.9994 73 12 5.7311 R4+20
-0.9999 75 5 5.2359 R2,R3,C1-50
-1.0003 80 24 3.3711 R2,R3,C1+50
-1.0000 75 11 4.6614 R2,R3,C1-20
-1.0001 77 19 4.0473 R2,R3,C1+20
-1.9996 76 15 4.4029 R1-50
-0.6665 76 15 4.4029 R1+50
-1.2498 76 15 4.4029 R1-20
-0.8332 76 15 4.4029 R1+20
V
est
(V) t
r
(s) t
d
(s) SP (%) Fault
Distance to each
Instance
d
m

d
2

d
1
Instance with
minimum
distance to the
new situation is
proposed as a
diagnostic
Select a metric: Ex.
Euclidean normalized.

2
m
1 i
i i
i range
y x
) y , x ( E
|
|
.
|
\
|
=
=

Ain . Ai2 Ai1
Attribute 1 Attribute 2 .... Attribute n
FAULT 1 A11 A12 A1n
FAULT 2 A21 A22 A2n
...
FAULT m Am1 Am2 Amn
Instance base
In the 1-Nearest Neighbor (NN) algorithm, The class is predicted
by the closest training sample .

1.2- Nearest Neighbour
Attribute 1
Attribute 2
Case2
Case1
Case3
New Case
Case4
Class
Case1 A
Case2 A
Case3 B
Case4 B
New case B

Requeriments:

-A set of labeled examples (training data)

- A metric to measure closeness
0.272
R1+50

0.165
R2,R3,C1
+20%
0.043
R4+20

0.803
R5-50
0.096
C2+20

0.204
Nominal
R1+20
0.221
82.04
69 -1.0000 76 15 4.4029 Nominal
100 -1.0000 80 15 1.0031 C2-50
99 -1.0005 72 17 7.6834 C2+50
89 -1.0000 77 15 3.0781 C2-20
82 -0.9997 73 16 5.7085 C2+20
100 -0.4999 75 5 5.2359 R6-50
98 -1.5004 80 24 3.3711 R6+50
86 -0.8000 75 11 4.6614 R6-20
79 -1.2001 77 19 4.0473 R6+20
91 -0.9996 86 31 2.1315 R5-50
82 -1.0002 76 9 4.8682 R5+50
38 -1.0002 77 20 3.9447 R5-20
47 -0.9999 75 12 4.6189 R5+20
100 -1.0000 92 29 0 R4-50
98 -0.9990 71 10 6.9145 R4+50
88 -1.0000 80 19 2.4917 R4-20
85 -0.9994 73 12 5.7311 R4+20
94 -0.9999 75 5 5.2359 R2,R3,C1-50
79 -1.0003 80 24 3.3711 R2,R3,C1+50
36 -1.0000 75 11 4.6614 R2,R3,C1-20
41 -1.0001 77 19 4.0473 R2,R3,C1+20
100 -1.9996 76 15 4.4029 R1-50
99 -0.6665 76 15 4.4029 R1+50
87 -1.2498 76 15 4.4029 R1-20
84 -0.8332 76 15 4.4029 R1+20
Success % V
est
(V) t
r
(s) t
d
(s) SP (%) Fault
time
Amplitude
SP= 5.73%
0,1 V
est
0,5 V
est
0,9 V
est
t
r
=72 s
V
est
= -0.98
t
d
=13 s
5.73%
13 s
72 s
5.73% 13 s 72 s -0.98
Exemple amb NN
There are many distance functions that have been proposed to
decide which instance is closest to a given input vector
Many of these metrics work well for numerical attributes but do
not appropriately handle nominal (i.e., discrete, and perhaps
unordered) attributes. Many real-world applications have both
nominal and linear attributes.
In general:
For continuous feature vectors, just use Euclidean distance

For discrete features, just assume distance between two
values is 0 if they are the same, 1 if different (e.g. Hamming
distance).

To compensate for differences in units, scale all continuous
values to normalize their values to be between 0 and 1.
Examples of distance functions for continuous attributes
The Euclidean and Manhattan distance functions are equivalent to
the Minkowskian distance function with r = 2 and 1, respectively.
Examples of distance functions for nominal attributes
Hamming
D(x,y)=
0 if x=y
1 if x=y
Value Difference Metric
(VDM)
where
N
a,x
is the number of instances in the training set T that have value x for attribute a;
N
a,x,c
is the number of instances in T that have value x for attribute a and output class c;
C is the number of output classes in the problem domain;
q is a constant, usually 1 or 2; and
P
a,x,c
is the conditional probability that the output class is c given that attribute a has
the value x, i.e., P(c | xa).

Pa,x,c is defined as

where Na,x is the sum of Na,x,c over all classes, i.e.,
and the sum of Pa,x,c over all C classes is 1 for a fixed value of a and x.
Examples of distance functions for nominal and continuous attributes
One way to handle applications with both continuous and nominal
attributes is to use a heterogeneous distance function that uses
different attribute distance functions on different kinds of attributes.
Heterogeneous Euclidean-Overlap Metric (HEOM)
This function defines the distance between two values x and y of
a given attribute a as:
The overall distance between two (possibly heterogeneous)
input vectors x and y is given by the Heterogeneous Euclidean-
Overlap Metric function HEOM(x,y):
For 1-nearest neighbour, the Voronoi diagram gives the complex
polyhedra that segment the space into the region of points closest
to each training example. (In two dimensions)
1. They are computationally expensive classifiers since they save all training
instances.
2. They are intolerant to the attribute noise.
3. They are intolerant to the irrelevant attributes.
4. They are sensitive to the choice of the algorithm's similarity function.
5. There is no natural way to work with nominal-valued attributes or missing
attributes.
6. They provide little usable information regarding the structure of the data.
k-NN Diag.
1-NN
Diag.
Nearest Neighbour main drawbacks:
Noise problem
Attribute 1
Attribute 2
New Case
Attribute 1
Attribute 2
New Case
It is based on the principle that the instances within a dataset
will generally exist in close proximity to other instances that have
similar properties. A supervised learning algorithm where new
instance query is classified based on majority of K-nearest
neighbors.
1.3 k-Nearest Neighbor (kNN)
The k-NN only requires:

- A set of labeled examples (training data)

- A metric to measure closeness

- An integer k
kNN classifier
The use of large values of k has two main advantages
Yields smoother decision regions, reduce the effect of noise
Provides probabilistic information

However, too large values of k are detrimental
It destroys the locality of the estimation since farther
examples are taken into account
In addition, it increases the computational burden.

A good k can be selected by parameter optimization using, for
example, cross-validation.

For most low-dimensional data, k is usually between 5-10

1NN v.s. kNN

Numerical example on the biquad circuit. K=3 Neighbors

Case Number SP td tr Vest Class
Case 1 3.97 18s 77s -0.8716 R1+20
(Norm) 0.433 0.5143 0.9506 -0.4636
Neighbors: Number: 37 11 19
Class: R1+20 R1+20 R1+20

Class R1+20
Instance 5 correctly classified by its neighbors.
Instance Number SP td tr Vest Class
Instance 5 4.33 17s 76s -0.8811 R1+20
R1+20 R1+20 R5-20 Class
45 36 697
Number
Neighbor 3 Neighbor 2 Neighbor 1

Distance-Weighted kNN
(Taking distance into account when voting )
It is more probable that the new instance belongs to the class
of the closest retrieved neighbor. Possible weight kernels:
w
k

Distance
Voting Weight
0
1
Linear
( )( )
k
j k k
k j
D
D D 1 e
+ e = e
Exponential
( )
K
j
D
D
k j
e = e
Gaussian
( )
2
K
2
j
D
D
k j
e = e
e
k
weight given to the k neighbor
D
k
distance to the k
th
neighbor
D
j
distance to the j
th
neighbor
R2+20 R2+20 R5-20 Class
43 26 650
Number

Example using the weight-distance exponential kernel
0.23 0.015 0.012 Distance
Simple voting: Classification R2+20
Classification of instance 633 with class R5-20
weighted voting:
e
k
=0.2
Exponential Kernel
0.012

0.015

0.230
Distance
( )
K
j
D
D
k j
e = e
0.92

Weight
Incorrect
Vote
R5-20
0.92

Vote
R2+20
0.35
0.20
+
0.65

0.35

0. 20

Weight
Weighted voting: Classification R5-20 Correct
1.4 Reduction Techniques for IBL
Storing too many instances can result in large memory requeriments
and slow execution speed, and can cause an oversensitivity to noise.
The basic Nearest Neighbour algorithm retains all of the training
instances. It learns very quickly because it need only read in the
training set without much further processing, and it generalizes
accurately for many applications. However, since the basic NN
algorithm stores all of the training instances, it has relatively large
memory requeriments. It must search through all available
instances to classify a new input vector, so it is slow during
classification
Reducing the number of cases in memory can help. The reduction
have to be done eliminating instances while keeping the
performance when classifying.
Reduction
algorithms
Central points
Border points
Incremental
Decremental
Points to retain
Search direction
Used metric
Euclidean
Clark
Manhattan
HVDM
DROP4

IB3

Reduction Techniques for IBL
Reduction Technique DROP4 (R. Wilson, T. Martinez)
(Decremental Reduction Optimization Procedure)

Decremental: It begins with the entire set T and removes
unnecessary instances . S is the obtained reduced set.

- How instance s
i
is classified by the others.

- How the others are classified without the instance s
i
.

Classification: Given by the class of the K-nearest instances.
Reduction Technique DROP4 (cont.)

Example with K=2 neighbors
Attribute 1
Attribute 2
Case2
Case1
Case3
Case4
Case5
Case Num Neighbor 1 Neighbor 2
1 2 3
2 1 4
3 5 4
4 3 5
5 3 4
Associate: Case a is an associate of case b if case b has case a as a neighbor.
Case Num Associate 1 Associate 2 Associate 3
1 2
2 1
3 1 4 5
4 2 3 5
5 3 4

Example with K=2 neighbors
Attribute 1
Attribute 2
Case2
Case1
Case3
Case4
Case5
Case Num Neighbor 1 Neighbor 2
1 2 3
2 1 4
3 5 4
4 3 5
5 3 4
Basic Rules:
1- Remove instance s
i
if it is correctly classified by its
neighbors.

2- Remove instance s
i
from S if at least as many of
its associates in T would be classified correctly without i.
Case Num Associate 1 Associate 2 Associate 3
1 2
2 1
3 1 4 5
4 2 3 5
5 3 4

Example 1: Biquadratic filter with K=3 neighbors

Case correctly classified by its neighbors.
Rule 1 satisfied
Instance 1 3.97 18s 77s -0.8716 R1+20
R1+20 R1+20 R1+20 Class
19 11 37
Number

37
(R1+20)
119
(R2+20)
29
(R1+20)
1
(R1+20)
45
(R1+20)
45
(R1+20)
1
(R1+20)
29
(R1+20)
697
(R5-20)
37
(R1+20)
124
(R2+20)
28
(R1+20)
1
(R1+20)
45
(R1+20)
29
(R1+20)
29
(R1+20)
1
(R1+20)
45
(R1+20)
2
(R1+20)
11
(R1+20)
Neighbor 4 Neighbor 3 Neighbor 2 Neighbor 1 Associates
29
(R1+20)
124
(R2+20)
45
(R1+20)
37
(R1+20)
1
(R1+20)
1
(R1+20)
1
(R1+20)
1
(R1+20)
Next
neighbor
Next
neighbor
Next
neighbor

R1+20
Correct

R1+20
Correct

R1+20
Correct

R1+20
Correct
Associates
Diagnosis
Analysis dropping Instance 1
Associates to Instance 1 classification
not affected by its dropping. Rule 2
SATISFIED
INSTANCE 1
DROPPED
Next
neighbor

R1+20
Correct

R1+20
Correct

R1+20
Correct

R1+20
Correct
Associates
Diagnosis
Instance 5 correctly classified by its neighbors.
Rule 1 satisfied
Instance 5 4.33 17s 76s -0.8811 R1+20
R1+20 R1+20 R5-20 Class
45 36 697
Number

Example 2: Biquadratic filter with K=3 neighbors
45
(R1+20)
690
(R5-20)
5
(R1+20)
687
(R5-20)
697
(R5-20)
687
(R5-20)
32
(R1+20)
680
(R5-20)
5
(R1+20)
42
(R1+20)
Neighbor 4 Neighbor 3 Neighbor 2 Neighbor 1 Associates
5
(R1+20)
Next
neighbor
Next
neighbor

R5-20
Correct

R5-20
Wrong
Associates
Diagnosis
Analysis dropping instance 5
Associates to Instance 5 classification
AFFECTED by its dropping.
Rule 2 NOT Satisfied
INSTANCE 5
NOT
DROPPED
5
(R1+20)
687
(R5-20)
45
(R1+20)

R5-20
Correct

R1+20
Correct
Associates
Diagnosis
Reduction Technique IB3 (D. Aha, D.Kibler and M. Albert)
(Instance Based Learning Algorithm 3 )

Incremental: It begins with an empty set S.

It maintains a classification record for each instance s
i
stored. This record
indicates how the instance is performing the classification of instances of
its same class.
Basic Rules:
1- If the instance record has a figure higher than a certain pre-established
limit it is accepted and used for classifying the subsequent instances.

2- If it is less than a certain limit, the instance is believed to be noisy and
it will be dropped from the base S

3- If it lies between the two, it is not used for prediction but its record is
updated.
For each instance t in T (training set)
Let a be the nearest acceptable instance in S to t
(if there are no acceptable instances in S, let a be a random instance in S)
If class(a) class(t) then add t to S
For each instance s in S
If s is at least as close to t as a is
Then update the classification record of s and remove
s from S if its classification record is significantly poor.
Remove all non-acceptable instances from S
An instance is acceptable if the lower bound on its accuracy is statistically
significantly higuer (at a 90% confidence level) than the upper bound on the
frequency of its class
An instance is dropped from S if the upper bound on its accuracy is statistically
significantly lower (at 70% confidence level) than the lower bound on the frequency
of its class.
IB3 algorithm
Other instances are kept in S during training and then dropped at the end if they do
not prove to be acceptable.
Reduction Technique IB3 (Cont.)

The limits are calculated using the upper and lower bounds of a
Bernoulli process probability being its mean the true probability.
Lower bound Upper bound
n: Number of classification attempts.
p: accuracy of such attempts
(n of correct matches/n)
z: Confidence index.
n
z
n
z
n
) p ( p
z
n
z
p
2
2
2 2
1
4
1
2
+
+
+
Upper and lower bounds of the random probability:
p: Frequency of instances of that class.
n: Number of total instances previously processed.
z: Confidence index.
For the accuracy limits For the frequency of a class limits
U
Si
L
Si

U
Ci
L
Ci

Probability distribution of
success of instance S
i

instances of a class C
i

L
Si
> U
Ci

Accepted
U
Ci
L
Ci

U
Si
L
Si

i

i

U
Si
< L
Ci
Rejected
Probability of success < Probability of being of class C
i

Probability of success Probability of being of class C
i

Confidence index
Z=0.9
Confidence index
Z=0.7
>
Instance s
i

Example

S
i
Accepted

Instance s
1

Instance s
2

Instance s
3

Instance s
i

Instance s
N

N=200

Instance s
i

N
C
=120 Same class C
i
n=100 interventions
C=80 correct
U
Si
=83.35%
L
Si
=76.17%
U
Ci
=63.07%
L
Ci
=56.85%
<
i

i

U
Ci
=63.07%
L
Si
=76.17%
Confidence index
z=0.9
Confidence index
z=0.9
Instance s
i

Example

S
i
Rejected

Instance s
1

Instance s
2

Instance s
3

Instance s
i

Instance s
N

N=1000

Instance s
i

N
C
=700 Same class C
i
n=900 interventions
C=200 correct
U
Ci
=71.29%
L
Ci
=68.68%
U
si
=23.49%
L
si
=21%
<
i

i

U
si
=23.49%
L
Ci
=68.68%
Confidence index
z=0.7
Confidence index
z=0.7
Sets comparison
for the biquad
filter
Fault Classic Spread DROP4 IB3
R1+20 84 83 85 77
R1-20 87 94 90 90
R1+50 99 98 99 97
R1-50 100 100 100 100
R2+20,R3+20,C1+20 41 30 41 35
R2-20,R3-20,C1-20 36 35 35 31
R2+50,R3+50,C1+50 79 78 83 72
R2-50,R3-50,C1-50 94 99 96 98
R4+20 85 89 87 78
R4-20 88 84 88 80
R4+50 98 98 98 98
R4-50 100 100 100 100
R5+20 47 33 46 36
R5-20 38 40 38 43
R5+50 82 78 83 75
R5-50 91 93 94 93
R6+20 79 78 79 78
R6-20 86 83 82 74
R6+50 98 99 100 100
R6-50 100 100 100 100
C2+20 82 75 74 76
C2-20 89 90 90 83
C2+50 99 99 99 95
C2-50 100 100 100 100
NOM 69 61 72 57
Average
82.04 80.68 82.36 78.64
size 25 12500 1112 2457
(8.8%) (19.6%)
2.1 Introduction
2.2 Case Representation
2.3 Retrieval
2.4 Reuse
2.5 Revise
2.6 Retain
2.7 Training
2.8 Performance measurements
2.8.1 Training and test
2.8.2 Cross Validation
2.8.3 Confusion Matrix
2.8.4 ROC analysis
2- Case Based Reasoning
2- Case Based Reasoning
It is based on Similar problems have similar solutions. A
new problem is solved by matching it with a similar past situation.
Advantages
It is easy to obtain rules

It is quite intuitive at diagnosis

It tolerates lazy-learning schemes
Drawbacks
Utility problem
The order in which cases are selected when
training is very important
- It is necessary to define good training and
maintenance policies
2.1 Introduction
The CBR Cycle consists of:
RETRIEVE the most similar case or cases
REUSE the information and knowledge in that case to solve the problem
REVISE the proposed solution
RETAIN the parts of this experience likely to be useful for future problem solving

CASE BASE

GENERAL
KNOWLEDGE
New problem
Retrieved
cases
Retrieved
Solution
Revised
Solution
Learned
case
1. Retrieve
2. Reuse
3. Revise
INPUT
OUTPUT
4. Retain
The CBR cycle
CBR main task and subtasks
2.2 Case base representation
Decide what to store in a case
Finding an appropriate structure for describing case contents
Deciding how the case memory should be organized and indexed
for effective retrieval and reuse
Cdigo
individ.
Ao de
inclusin
familia en
estudio
(2 dgitos)
Proband
o/No
probando
Pas de
resid.
Es
usted
proband
o de su
familia?
Tipo de
familia
Fecha
de
nacim. Ciudad C.P.
Prov. de
resid.
(cdigo)
Afecto
de Neo
de mama/
ovario?
1 95 0 108 0 22/03/32 LLEIDA 25003 25 1
2 95 0 108 0 12/09/28 BARCELONA 08 1
3 95 1 108 1 1 02/05/58 LLEIDA 25006 25 1
4 95 0 108 0 0 03/11/64 LLEIDA 25003 08 0
5 95 0 108 0 26/07/36 BARCELONA 08 1
6 95 1 108 1 2 26/03/51 BARCELONA 08 1
7 95 0 108 0 23/07/19 BARCELONA 08 1
Attributes (characteristics)
Solution
Flat case structure
Caractersticas (contenido)
Descriptivas del problema
Objetivos a resolver
Restricciones sobre los objetivos
Otra informacin descriptiva (datos iniciales)
Soluciones
La solucin propiamente dicha
Los pasos del proceso de razonamiento (traza)
Justificaciones de las decisiones
Soluciones alternativas
Expectativas
Feedback del resultado (si existe)
xito o fracaso de la solucin
Expectativas cumplidas o violadas
Explicacin del fracaso
Explicacin de las anomalas, explicacin de las estrategias de
reparacin, siguiente caso, ...
ndices: vocabulario
Propsito: recupera los casos ms tiles
Depende de la tarea, caractersticas del dominio
Los ndices pueden proceder de
Caractersticas observadas
Caractersticas derivadas (inferidas)
ndices buenos
Predictivos
Discriminatorios
Abstractos al nivel adecuado
Los ndices deben ser
predictivos (ej. Categoras)
discriminatorios (ej. Valores)
explicativos
Qu se debe indizar
Soluciones
Resultados correctos
Resultados errneos
ndices: seleccin
Organizaciones de memoria
Organizacin plana, bsqueda en serie
Organizacin jerrquica, redes de caractersticas compartidas
Redes de discriminacin con prioridades
Redes de discriminacin con redundancias
Organizacin plana, bsqueda en paralelo
Organizacin jerrquica, bsqueda en paralelo
Organizacin plana, bsqueda en serie
Los casos se almacenan en una lista, tabla, fichero.

Bsqueda: recorrido por todos los casos

Pequeo
Rojo
Esfera
Pequeo
Rojo
Piramide
Grande
Rojo
Esfera
Grande
Azul
Prisma
Tamao Color Forma
no parcial no
si parcial no
si parcial no
si no si
Pequeo, Rojo, Esfera
Grande, Rojo, Esfera
Grande, Azul, Prisma
Grande, Rojo, Pirmide
Input:
Grande, Naranja, Prisma
Output:
Organizacin plana ...
Ventajas
Devuelve el caso o conjunto de casos ms similares
El proceso de aadir un caso es simple
Inconvenientes
La recuperacin es costosa
Variaciones
Indexacin superficial
Particin de la librera de casos
Bsqueda y comparacin en paralelo.
Organizacin jerrquica, redes de caractersticas compartidas
Cada caso en un nodo de un rbol
El grafo subdivide el espacio en funcin de las caractersticas compartidas
por los casos
Se pueden incluir umbrales (threshold) sobre los valores de las
caractersticas.
Recuperacin: se recorre el rbol en amplitud, siguiendo los nodos (clusters
de casos) que tienen mayor similitud.
Ejemplo
Grande
Azul
Prisma
Rojo
Grande
Pequeo
Esfera
Pirmide Esfera
Pequeo, Rojo, Esfera
Grande, Rojo, Esfera
Grande, Rojo, Pirmide
Grande
Azul
Prisma
Rojo
Pequeo
esfera
Pirmide Esfera
Input:
Grande, Naranja, Prisma
Output: Grande, Rojo, pirmide
Grande
Consideraciones
La jerarqua de los nodos debe de corresponder a la importancia de las
caractersticas.

En el ejemplo, si FORMA es ms importante que COLOR:
Grande
Azul
Prisma
Pequeo
Rojo
Pirmide
Rojo
Esfera
Grande Pequeo
Organizacin jerrquica...
Ventajas
Ms eficiente

Inconvenientes
La incorporacin de un caso es ms compleja
El mantenimiento ptimo es costoso
Se necesita espacio para la organizacin
No hay garanta de que algn caso mejor no se recupere
2.3 Case Retrieval
Three main issues:
Similarity function
Matching
Ranking
Entre caractersticas (local)
Similitud total o exacta
Similitud parcial
Entre casos (global)
Similitud total o exacta
Similitud parcial
Similarity function

Matching
Estructural
Semntica
Organizativa
Pragmtica
Starts with partial problem description, and ends when a best matching previous
case is found
Uso del ordenador
Domstico Industrial Espacial
Domstico 1 0,4 0
Industrial 0,8 1 0,2
Espacial 0,6 0,8 1
Exemple de similitud definida amb taula
Ranking
Eliminar los casos con puntuacin baja
Ordenar los casos
Valorar las caractersticas que aparecen siempre conjuntamente
(contexto)
Hacer un balance (trade-off) entre el nmero de caractersticas
presentes y ausentes
Valorar el coste inferencial de la adaptacin
Considerar los casos ms especficos antes que los ms generales
Preferir los casos ms frecuentes
Preferir los casos ms recientes
El proceso de seleccin del caso mejor debe de ser un mtodo rpido.

De otra manera la eficiencia del razonamiento basado en casos se pierde.
2.4 Case Reuse
Copy
Adapt

The reuse of a retrieved case solution in the context of the new case
focuses on two aspects:

- Difference between past and current case

- What part of retrieved case can be transferred to new
solution
2 main approaches:
Copy: a simple classification where differences are abstracted away
and the solution class of the retrieved case is transferred to the new
case as its solution class

Adapt:
Reuse the past solution (transformational reuse): uses
transformational operators {T} to transform the differences of
the old solution to a new one for the new case

Reuse the past method that constructed the solution
(derivational reuse) retrieved case holds information about the
method used for solving the old problem and replays the old
plan into the new context. Looks at justification of operators,
subgoals, alternatives, failed search paths
ADAPTACIN
Objetivo:
Ajustar la solucin no-exacta para adecuarla al problema actual
Reparar una solucin errnea
Qu adaptar: valores , estructuras
Mtodos
Substitucin
Transformacin
Otros
Ejemplo
Caso actual
Fritura
Pollo y guisantes

Cortar el brcol en
trozos, desmenuzar la
ternera, marinar la
ternera en ....

Caso de memoria
Fritura
Ternera y brcol

Cortar el brcol en
ternera, marinar la
ternera en ....
?
Ejemplo...
Protenas Vegetales

Carne Huevos Mariscos Amarillos Verdes

Roja Volatera Pescado Crustceos Brcol Guisantes

Ternera Pollo
Ejemplo
Caso actual
Fritura
Pollo y guisantes

Cortar el guisantes en
pollo, marinar la
pollo en ....

Caso de memoria
Fritura
Ternera y brcol

Cortar el brcol en
ternera, marinar la
ternera en ....
The Class with more votes wins
Voting (majority rule)
Similitude Class
0.623069 1
0.579494 1
0.567825 2
0.615917 2
0.411370 1
0.507618 1
Example with K=6 neighbors (1:Cancer; 2:Healthy)
Voting result :
4 cancer
2 Healthy
Cancer
The variable DV
bw
is calculated as
Bilska-Wolak and Floyd Method
Example. Classifying a case C
i

{ }
K
C C C in cases of num
DV
K
bw
... , " "
2 1
+
=
If DV
bw
< t
Healthy
A threshold level t is defined
If DV
bw
t
Cancer
C
1
+ C
2
+ C
3
- C
4
-
Retrieved cases
5 . 0
4
2
= =
bw
DV
Threshold t =0.5
Cancer
DV
bw
t
0.8 0.6 0.5 0.4
This decision is independent of the similarity degree of the cases.
Example with a threshold t = 0,5
Sim(C
1
+,T) Sim(C
2
+,T) Sim(C
3
-,T) Sim(C
4
-,T) DV
bw
Diagnostic T
0.5 0.5 0.5 0.5 0.5 +
0.8 0.8 0.5 0.5
0.4 0.4 0.5 0.5
0.4 0.5 0.4 0.5
+
5 , 0
4
2
= =
bw
DV
Cancer
DV
bw
t
0.5
Example with a threshold t = 0,5
Sim(C
1
+,T) Sim(C
2
+,T) Sim(C
3
-,T) Sim(C
4
-,T) DV
bw
Diagnostic T
0.5 0.5 0.5 0.5 0.5 +
0.8 0.8 0.5 0.5 0.5 +
0.4 0.4 0.5 0.5
0.4 0.5 0.4 0.5
+
+
0.5
0.5
The variable DV is obtained as
Other method
=
+
+
=
K
i
i
C
i
) T , C ( Sim
) T , C ( Sim
DV
i
1
If DV < t
Healthy
If DV t
Cancer
It takes into account the value of the distance when calculating the
decision variable
A thresehold level t is defined
Example. Classifying a case C
i

C
1
+ C
2
+ C
3
- C
4
-
Retrieved cases
6 0
3 2
4 1
4 0 5 0 6 0 8 0
6 0 8 0
.
.
.
. . . .
. .
DV = =
+ + +
+
=
Threshold t =0.5
Cancer
DV t
0.8 0.6 0.5 0.4
Other method
Sim(C
1
+,T) Sim(C
2
+,T) Sim(C
3
-,T) Sim(C
4
-,T) DV
bw
Diagnstic T
0.5 0.5 0.5 0.5
0.8 0.8 0.5 0.5
0.4 0.4 0.5 0.5
0.4 0.5 0.4 0.5
5 0
5 0 5 0 5 0 5 0
5 0 5 0
1
.
. . . .
. .
) T , C ( Sim
) T , C ( Sim
DV
K
i
i
C
i
i
=
+ + +
+
= =
=
+
+
DV t
Example with a threshold t = 0.5
Cancer
+
0.5
0.5
Cancer
Other method
Sim(C
1
+,T) Sim(C
2
+,T) Sim(C
3
-,T) Sim(C
4
-,T) DV
bw
Diagnstic T
0.5 0.5 0.5 0.5 0.50 +
0.8 0.8 0.5 0.5
0.4 0.4 0.5 0.5
0.4 0.5 0.4 0.5
61 0
5 0 5 0 5 0 5 0
8 0 8 0
1
.
. . . .
. .
) T , C ( Sim
) T , C ( Sim
DV
K
i
i
C
i
i
=
+ + +
+
= =
=
+
+
+
0.61
0.61
DV t
Cancer Cancer
Other method
Sim(C
1
+,T) Sim(C
2
+,T) Sim(C
3
-,T) Sim(C
4
-,T) DV
bw
Diagnstic T
0.5 0.5 0.5 0.5 0.50 +
0.8 0.8 0.5 0.5 0.61 +
0.4 0.4 0.5 0.5
0.4 0.5 0.4 0.5
44 0
5 0 5 0 4 0 4 0
4 0 4 0
1
.
. . . .
. .
) T , C ( Sim
) T , C ( Sim
DV
K
i
i
C
i
i
=
+ + +
+
= =
=
+
+
-
0.44
0.44
DV t
Healthy Healthy
Other method
Sim(C
1
+,T) Sim(C
2
+,T) Sim(C
3
-,T) Sim(C
4
-,T) DV
bw
Diagnstic T
0.5 0.5 0.5 0.5 0.50 +
0.8 0.8 0.5 0.5 0.61 +
0.4 0.4 0.5 0.5 0.44 -
0.4 0.5 0.4 0.5
0.50 +
5 0
5 0 4 0 5 0 4 0
5 0 4 0
1
.
. . . .
. .
) T , C ( Sim
) T , C ( Sim
DV
K
i
i
C
i
i
=
+ + +
+
= =
=
+
+
Other method- Bilska Wolak Comparison
Sim(C
1
+,T) Sim(C
2
+,T) Sim(C
3
-,T) Sim(C
4
-,T) DV
bw
Diagnstic T
0.5 0.5 0.5 0.5 0.50 +
0.8 0.8 0.5 0.5 0.61 +
0.4 0.4 0.5 0.5 0.44 -
0.4 0.5 0.4 0.5 0.50 +
It helps discriminating
Sim(C
1
+,T) Sim(C
2
+,T) Sim(C
3
-,T) Sim(C
4
-,T) DV
bw
Diagnostic T
0.5 0.5 0.5 0.5 0.5 +
0.8 0.8 0.5 0.5 0.5 +
0.4 0.4 0.5 0.5 0.5 +
0.4 0.5 0.4 0.5 0.5 +
Bilska-Wolak
Our method
2.5 Case Revision
Occurs when a case solution generated by the reuse phase is not correct
An opportunity for learning occurs
Consists of two tasks
(1) evaluate the case solution generated by reuse. If successful, learn
from the success
(2) otherwise repair the case solution using domain-specific knowledge
Evaluate solution
- steps outside the CBR system
- requires asking the expert or performing the task in the real world.
- example: the success or failure of a medical treatment
Repair fault
- Involves the detecting of errors of the current solution and retrieving
or generating explanations for them.
- May be predicted, handled and avoided
- Revised plan can then be retained directly or reevaluated and
repaired again

2.6 Case Retention
Process of incorporating what is useful to retain from the new problem solving
episode into existing knowledge

Extract
-case base is updated regardless of how problem was solved
-failures, or information from the revise task, may be saved as well
-relevant problem descriptors, problem solutions, explanations, or
justifications can all be saved and reused later

Index
- indexing problem is a major problem in case-based reasoning
- How do we structure the search space of indexes?
- trivially we can simply use all input features as indices (the approach of
syntax-based methods within instance-based and memory-based
reasoning)

Integrate
- Integration of the new case knowledge into the existing case-base.
For z =0.9, the cases are marked for
removal sooner (lower N
cross
) than
using z=0.3

Eliminate cases that:

-Diagnose incorrectly
L
si
< U
ci

0 10 20 30 40 50 60 70 80
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
Number of interventions
L
si

U
ci

P
r
o
b
a
b
i
l
i
t
y

0 10 20 30 40 50 60 70 80
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
Number of interventions
L
si

U
ci

P
r
o
b
a
b
i
l
i
t
y

-They have no intervention in the
decision of cases of its same
class.
Example. IB-like algorithm
2.7 Training
Two different usages of CBR:
- Training
New
case

Vest
i
tr
i
td
i
SP
i
Case i
Solution Unknown
Supervised learning: When training the solution is
known
- Diagnosing (new cases)
0
50 100 150 200
15
20
25
30
35
40
45
50
CORRECT
number of trainings
P
e
r
c
e
n
t
a
g
e

0 50 100 150 200
0
200
400
600
800
1000
1200
1400
NUMBER of CASES
number of trainings
N
u
m
b
e
r

o
f

c
a
s
e
s

CBR System performance while training
2.8 Performance measurements
2.8.1 Training and test
Sha de disposar dun conjunt de casos de memria i dun conjunt de casos
de test. Si no es tenen casos nous a part dels utilitzats per entrenar, sha
dutilitzar un subconjunt dells per entrenar i laltre per testejar el sistema CBR.
Normalment, cal per, comprovar que el sistema funciona correctament per
diferents casos de test. Pot ser que el percentatge dencerts que dna el
sistema dna molt malament per casualitat a lhora descollir els casos de test,
o pel mateix motiu, que per casualitat estigui donant resultats molt bons.
El resultat pot sortir esbiaixat.
2.8.2 Cross-validation
Dividir el conjunt de casos disponibles en M subconjunts. Sutilitzar un dels
subconjunts com a conjunt de test, i els altres M-1 per fer lentrenament del
sistema. Es repeteix el procs utilitzant cada un dels conjunts com a conjunt
de test i la resta per entrenar.
Inconvenients:

-Elevat temps de computaci

-Si la base disposa de poques dades, si sen separen uns quants per fer el
test, perdem casos dentrenament.
S
10
S
1
S
n
S
3
S
4
S
n
S
6
S
1
Case 1
Case 2

Case L
Case L+1
Case L+2

Case 2L
Case 2L+1
Case 2L+2

Case 3L

Case (n-1)L+1
Case (n-1)L+2

Case nL

The training process
It is sensitive to how the
training cases are sorted
S
1
S
2
S
3
S
n

S
1
S
2
S
3
S
n

,

T
1
= ,

,

,

,

T
2
= ,

,

,

,

T
m
= ,

,

,

0 50 100 150
10
20
30
40
number of trainings
P
e
r
c
e
n
t
a
g
e

100 150
0
10
20
30
40
number of trainings
P
e
r
c
e
n
t
a
g
e

0 50 100 150
60
70
80
100
number of trainings
P
e
r
c
e
n
t
a
g
e

50
CORRECT with PRECISION
WRONG
TOTAL CORRECT
90
0
0 50 100 150
10
20
30
40
50
number of trainings
P
e
r
c
e
n
t
a
g
e

0 50 100 150
0
20
40
60
number of trainings
P
e
r
c
e
n
t
a
g
e

50 150
0
100
200
300
number of trainings
N
u
m
b
e
r

o
f

c
a
s
e
s

100
CORRECT at COMPONENT level
CORRECT at MODULE level
NUMBER OF CASES
0
Training 132
Training 132
Example of training with multiple
combinations of the training sets
2.8.3 Confusion Matrix
Real
Estimat
tp fp
fn tn
N P
TP
t
p
+
=
P: Nmero de casos positius (ex. malalts)
N: Nmero de casos negatius (ex. sans)
TP: Nmero de casos positius encertats
correctament
TN: Nmero de casos negatius encertats
correctament
FP: Nmero de casos positius NO
encertats
FN: Nmero de casos negatius NO
encertats
N P
TN
t
n
+
=
N P
FN
f
n
+
=
N P
FP
f
p
+
=
Amb minscules, es refereix a rate.
Ex. Tp True Positive Rate
Exemple matriu de confusi
Cas Estimat Real
Cas1 1 1
Cas2 1 1
Cas3 2 1
Cas4 2 2
Cas5 1 1
Cas6 1 2
Cas7 2 1
Cas8 2 2
Cas9 1 1
Cas10 2 2
Positiu: 1
P=6
N=4
TP:4
TN:3
FP:1
FN: 1
N P
TP
t
p
+
=
N P
TN
t
n
+
=
N P
FN
f
n
+
=
N P
FP
f
p
+
=
=0,4 =0,3
=0,1 =0,1
Real
Estimat
0,4 0,1
0,1 0,3
2.8.4 ROC analysis
ROC (Receiver Operating Characteristics). Representa de forma grfica la
relaci entre tp i fp dun sistema de diagnosi en funci dun parmetre de
decisi. El procediment per dibuixar-les ha estat el segent:
1. Saplica el mtode CBR en tots els casos de test

2. Es comptabilitza el nombre de casos malalts encertats (TP) i el nombre de
casos sans que shan diagnosticat com malalts (FP). Amb aix es calculen els
rtios de true positive (tp) i false positive (fp)

3. Es repeteix aquest clcul per diferents llindar t, fent variar t de 0 a 1, obtenint
finalment la corba ROC.
Lalgorisme seguit ha estat el segent:

1.Donat un nou pacient (cas de test)

2.Sha calculat la similitud del cas de test amb cada un dels casos de la
memria.

3.Shan extret els n casos de la memria ms similars al cas de test

4. Sha calculat el coeficient

( )
( )
=
i
j
p sim
p sim
On p
j
sn els casos extrets de la memria que tenen diabetis i p
i
sn tots
els casos extrets.

5.Sestableix un llindar de decisi t. Si > t es considera que el cas de test
s diabtic. Altrament, es considera que est sa.
Exemple. Diagnstic de diabetis
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive Rate
T
r
u
e

P
o
s
i
t
i
v
e

R
a
t
e

n=3
n=5
n=7
n=9
n=11
n=13
La corba serveix per ajudar a escollir el valor del llindar ms adequat. Lexpert, s el
que decidir el nombre de FP i TP tolerats depenent de laplicaci
Un bon mtode de diagnstic dna corbes ROC que estan per sobre de la diagonal
i que passin el ms a prop possible del punt [0,1] (resultat ideal!!!)

Statistical Learning Techniques

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Statistical Learning Techniques

Hochgeladen von

Copyright:

Verfügbare Formate

Statistical Learning

Das könnte Ihnen auch gefallen