Sie sind auf Seite 1von 34

Similarity and Dissimilarity

Lect 09/10-08-09 1
• Similarity : numerical measure between
two objects to the degree they are alike.
• The value of similarity measure will be
high if two objects are very similar.
• If scale is [0,1], then 0 indicates no
similarity and I indicates complete
similarity.

Lect 09/10-08-09 2
• Dissimilarity:
– numerical measure between two objects to the
degree to which they are different.
– Diss. Is lower for similar pair of objects.
• Transformations:
– Suppose simi. between 2 obj. range from 1 ( not at
all similar) to 10 (comp. simi.)
– They can be transf. with in the range [0,1] as
follows:
• S’=(S-1)/9, where S and S’ are original and transformed
similarity values.

Lect 09/10-08-09 3
• It can also be generalized as:
– S’=(s - min_s)/(max_s – min_s)
– Lly, diss. Can be mapped on to the interval
[0,1]
– D’=(d-min_d)/(max_d-min_d).

Lect 09/10-08-09 4
Similarity & Diss. Betn simple
attributes
• Consider objects having single attribute
– One nominal attr.
• In context of such an attr., simi. means whether the
two objects have same value or not.
S=1 if att. values match
S=0 if att. Values doesn’t match
Diss. can be defined similarly in opposite
way.
Lect 09/10-08-09 5
• One ordinal attr.
– Order is important and should be taken into
account.
– Suppose an attr. measures the quality of a product
as follows: {poor, fair, OK, good, wonderful}
• Product P1= wonderful, Product P2= good,
P3=OK and so on
• In order to make this observation quantitative
say {poor=0, fair=1, OK=2, good = 3,
wonderful=4}
• Then d(P1,P2)=4-3=1
• Or d(P1,P2) = (4-3)/4=0.25
• A simi. For ordinal attr. can be defined as s=1-d

Lect 09/10-08-09 6
• For interval or ratio attr., the measure of
dissi. bet. 2 objects is absolute difference
between ordinal attributes.

Lect 09/10-08-09 7
Similarity/Dissimilarity for Simple
Attributes
p and q are the attribute values for two data objects.

Lect 09/10-08-09 8
Dissimilarity between data objects
• Distances:
• The Euclidean distance bet. Two data objects
or points
n
d ( x, y ) = ∑ ( x k − y k ) 2
k =1

where n is the no. of dimensions and xk & yk are the kth


attr. of x & y.

Lect 09/10-08-09 9
Euclidean Distance

2 p1 point x y
p3 p4 p1 0 2
1 p2 2 0
p2 p3 3 1
0
p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix

Lect 09/10-08-09 10
Minkowski Distance

1
n r r
dist = ( ∑ | pk − qk |)
k =1

Lect 09/10-08-09 11
Minkowski Distance: Examples
• r = 1. City block (Manhattan, taxicab, L1 norm) distance.
– A common example of this is the Hamming distance, which is just the
number of bits that are different between two binary vectors

• r = 2. Euclidean distance

• r → ∞. “supremum” (Lmax norm, L∞ norm) distance.


– This is the maximum difference between any component of the vectors

• Do not confuse r with n, i.e., all these distances are


defined for all numbers of dimensions.

Lect 09/10-08-09 12
Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
point x y p4 6 4 2 0
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L∞ p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
Lect 09/10-08-09 13
Common Properties of a Distance
• Distances, such as the Euclidean distance, have some
well known properties.

1. d(p, q) ≥ 0 for all p and q and d(p, q) = 0 only if p


= q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r) ≤ d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between
points (data objects), p and q.

• A distance that satisfies these properties is a Metric

Lect 09/10-08-09 14
Example of Non-Metric diss.
• Two sets A={1,2,3,4}, B={2,3,4}
• A-B={1}, B-A={ }
• Define the distance d bet. A and B as
d(A,B)=size (A-B),
size is function returning the no. of elts. in a
set.
• This dist. measure doesn’t satisfy Positivity
property, symmetry property and triangular
inequality.
• These properties can hold if diss. Meas. Is modified as
follows:
D(A,B)=size (A-B) + size (B-A).
Lect 09/10-08-09 15
Common Properties of a
Similarity
• Similarities, also have some well known
properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points


(data objects), p and q.

Lect 09/10-08-09 16
Similarity Between Binary Vectors
• Common situation is that objects, p and q, have
only n binary attributes.
• p and q becomes binary vectors & leads to
following 4 quantities.
• Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1

• Simple Matching Coefficients ( similarity coeff.)


SMC = number of matches / number of attributes
= (M11 + M00 ) / (M01 + M10 + M11 + M00 )

Lect 09/10-08-09 17
SMC versus Jaccard: Example
J = number of 11 matches / number of not-both-zero attributes values
= (M11 ) / (M01 + M10 + M11 )
EX. p = (1,0, 0, 0, 0, 0, 0, 0, 0, 0)
q = (0, 0, 0, 0, 0, 0, 1, 0, 0,1)

M01 = 2 (the number of attributes where p was 0 and q was 1)


M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (M11 + M00 )/(M01 + M10 + M11 + M00 ) = (0+7) / (2+1+0+7) = 0.7
J = (M11 ) / (M01 + M10 + M11 ) = 0 / (2 + 1 + 0) = 0

Lect 09/10-08-09 18
Cosine Similarity
• This similarity measure is used for
documents.
• Doc. are often represented as vectors
• Each attribute represents the frequency of
each word.
• Each document have thousands & tens of
thousands of attributes.
• The Cosine simi. is one of the most
common measure of document similarity.
Lect 09/10-08-09 19
Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 • d2) / ||d1|| ||d2|| ,
where • indicates vector dot product and || d || is the length of vector d.

• Example: consider 2 document vectors as:


d1 = (3, 2, 0, 5, 0, 0, 0, 2, 0, 0)
d2 = (1, 0, 0, 0, 0, 0, 0, 1, 0, 2)
d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 =(42)0.5 =
6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = 0.3150

Lect 09/10-08-09 20
• If cosine simi.= 1, then angle bet x and y is
0 and x & y are same except for
magnitude and vice-versa.
y

θ x

Lect 09/10-08-09 21
Extended Jaccard Coefficient
(Tanimoto)
• Can be used for document data
• Reduces to Jaccard for binary attributes

Lect 09/10-08-09 22
Correlation
• Correlation measures the linear relationship
between objects
• Example (Perfect Correlation): the following
2 sets of values for x and y indicate cases
where corr. Is -1 and +1.
• x=(-3,6,0,3,-6) and y=(1,-2,0,-1,2) [COMPUTE]
• x=(3,6,0,3,6) and y=(1,2,0,1,2)[COMPUTE]

Lect 09/10-08-09 23
• To compute correlation, we standardize data
objects, p and q, and then take their dot product
pk′ = ( pk − mean( p)) / std ( p)

qk′ = ( qk − mean( q)) / std ( q)

correlation( p, q) = p′ • q′

Lect 09/10-08-09 24
Visually Evaluating Correlation
•Scatter plots showing the similarity from –1 to 1.

•Itiseasytovisualizethe
correlationbetween2data

objectsxandybyplottingpairs


of corre
Thsis
pofig
ndu
inre
ga
sttrib
how usteano. of

values.
suchplotswherexandy

has30attributes. Eachcircle

intheplotrepres. oneof

the30attr. Itsx-coord. isthe

valueofoneoftheatt. ofx

Lect 09/10-08-09 25
whileitsycoord. isthevalue
Issues in computing proximity
measures
• 1. Handling the cases where attr. have different
scales/ & or correlated.
• 2. Computing proximity between objects when
they have different types of attr.
(qualitative/quantitative).
• 3. Handling attr. having different weights i.e.
when all attr. don’t contribute equally to the
proximity of objects.

Lect 09/10-08-09 26
Mahalanobis Distance

mahalanobi s( p, q) = ( p − q) ∑−1 ( p − q)T


Σ is the covariance matrix
of the input data X

Mahalanobis distance
is useful when attr.
Are correlated and
have different ranges
of values and
distribution is
approx. Gaussian.

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.


Lect 09/10-08-09 27
Mahalanobis Distance
Covariance Matrix:

 0.3 0.2
Σ= 
 0. 2 0. 3
C

B A: (0.5, 0.5)
B: (0, 1)
A C: (1.5, 1.5)

Mahal(A,B) = 5
Mahal(A,C) = 4

Lect 09/10-08-09 28
Combining Similarities for
heterogeneous attributes
• Sometimes attributes are of many different types,
but an overall similarity is needed.

Lect 09/10-08-09 29
Using Weights to Combine Similarities
• When some attr. are more important for
calculating proximity than others
• May not want to treat all attributes the same.
– Use weights wk which are between 0 and 1
and sum to 1, then previous eqn. becomes

The Minkowski distance becomes:

Lect 09/10-08-09 30
Density
• Density-based clustering require a notion of
density

• Examples:
– Euclidean density
• Euclidean density = number of points per unit volume

– Probability density

– Graph-based density

Lect 09/10-08-09 31
Euclidean Density – Cell-based
• Simplest approach is to divide region into
a number of rectangular cells of equal
volume and define density as # of points
the cell contains

Lect 09/10-08-09 32
Euclidean Density – Center-
based
• Euclidean density is the number of points
within a specified radius of the point

Lect 09/10-08-09 33
Review Questions (Unit 2)
• 1. What are the issues related to data quality? Discuss.
• 2. What is the difference between noise and an outlier?
• 3. Data quality issues can be considered from an application point of view. Discuss.
• 4. Write a short note on Data Preprocessing.
• 5. Discuss the difference between similarity and dissimilarity.
• 6. State 3 axioms that makes a distance a metric.
• 7. Discuss similarity measures for Binary variables.
• 8. What is SMC?
• 9. How cosine similarity is used for measuring similarity between 2 document
vectors.
• 10. Discuss issues related to computation of proximity measures.
• 11.For the following vectors x, y calculate the indicated similarity or distance
measure:
(a) X=(1,1,1,1) y=(2,2,2,2) cosine, correlation, Euclidean.
(b) x=(0,1,0,1), y=(1,0,1,0) cosine, correlation, Euclidean, Jaccard, Hamming
Distance( for binary data L1 distance corresponds to Hamming distance).

Lect 09/10-08-09 34

Das könnte Ihnen auch gefallen