Beruflich Dokumente
Kultur Dokumente
Lect 09/10-08-09 1
• Similarity : numerical measure between
two objects to the degree they are alike.
• The value of similarity measure will be
high if two objects are very similar.
• If scale is [0,1], then 0 indicates no
similarity and I indicates complete
similarity.
Lect 09/10-08-09 2
• Dissimilarity:
– numerical measure between two objects to the
degree to which they are different.
– Diss. Is lower for similar pair of objects.
• Transformations:
– Suppose simi. between 2 obj. range from 1 ( not at
all similar) to 10 (comp. simi.)
– They can be transf. with in the range [0,1] as
follows:
• S’=(S-1)/9, where S and S’ are original and transformed
similarity values.
Lect 09/10-08-09 3
• It can also be generalized as:
– S’=(s - min_s)/(max_s – min_s)
– Lly, diss. Can be mapped on to the interval
[0,1]
– D’=(d-min_d)/(max_d-min_d).
Lect 09/10-08-09 4
Similarity & Diss. Betn simple
attributes
• Consider objects having single attribute
– One nominal attr.
• In context of such an attr., simi. means whether the
two objects have same value or not.
S=1 if att. values match
S=0 if att. Values doesn’t match
Diss. can be defined similarly in opposite
way.
Lect 09/10-08-09 5
• One ordinal attr.
– Order is important and should be taken into
account.
– Suppose an attr. measures the quality of a product
as follows: {poor, fair, OK, good, wonderful}
• Product P1= wonderful, Product P2= good,
P3=OK and so on
• In order to make this observation quantitative
say {poor=0, fair=1, OK=2, good = 3,
wonderful=4}
• Then d(P1,P2)=4-3=1
• Or d(P1,P2) = (4-3)/4=0.25
• A simi. For ordinal attr. can be defined as s=1-d
Lect 09/10-08-09 6
• For interval or ratio attr., the measure of
dissi. bet. 2 objects is absolute difference
between ordinal attributes.
Lect 09/10-08-09 7
Similarity/Dissimilarity for Simple
Attributes
p and q are the attribute values for two data objects.
Lect 09/10-08-09 8
Dissimilarity between data objects
• Distances:
• The Euclidean distance bet. Two data objects
or points
n
d ( x, y ) = ∑ ( x k − y k ) 2
k =1
Lect 09/10-08-09 9
Euclidean Distance
2 p1 point x y
p3 p4 p1 0 2
1 p2 2 0
p2 p3 3 1
0
p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Lect 09/10-08-09 10
Minkowski Distance
1
n r r
dist = ( ∑ | pk − qk |)
k =1
Lect 09/10-08-09 11
Minkowski Distance: Examples
• r = 1. City block (Manhattan, taxicab, L1 norm) distance.
– A common example of this is the Hamming distance, which is just the
number of bits that are different between two binary vectors
• r = 2. Euclidean distance
Lect 09/10-08-09 12
Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
point x y p4 6 4 2 0
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L∞ p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
Lect 09/10-08-09 13
Common Properties of a Distance
• Distances, such as the Euclidean distance, have some
well known properties.
Lect 09/10-08-09 14
Example of Non-Metric diss.
• Two sets A={1,2,3,4}, B={2,3,4}
• A-B={1}, B-A={ }
• Define the distance d bet. A and B as
d(A,B)=size (A-B),
size is function returning the no. of elts. in a
set.
• This dist. measure doesn’t satisfy Positivity
property, symmetry property and triangular
inequality.
• These properties can hold if diss. Meas. Is modified as
follows:
D(A,B)=size (A-B) + size (B-A).
Lect 09/10-08-09 15
Common Properties of a
Similarity
• Similarities, also have some well known
properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.
Lect 09/10-08-09 16
Similarity Between Binary Vectors
• Common situation is that objects, p and q, have
only n binary attributes.
• p and q becomes binary vectors & leads to
following 4 quantities.
• Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1
Lect 09/10-08-09 17
SMC versus Jaccard: Example
J = number of 11 matches / number of not-both-zero attributes values
= (M11 ) / (M01 + M10 + M11 )
EX. p = (1,0, 0, 0, 0, 0, 0, 0, 0, 0)
q = (0, 0, 0, 0, 0, 0, 1, 0, 0,1)
SMC = (M11 + M00 )/(M01 + M10 + M11 + M00 ) = (0+7) / (2+1+0+7) = 0.7
J = (M11 ) / (M01 + M10 + M11 ) = 0 / (2 + 1 + 0) = 0
Lect 09/10-08-09 18
Cosine Similarity
• This similarity measure is used for
documents.
• Doc. are often represented as vectors
• Each attribute represents the frequency of
each word.
• Each document have thousands & tens of
thousands of attributes.
• The Cosine simi. is one of the most
common measure of document similarity.
Lect 09/10-08-09 19
Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 • d2) / ||d1|| ||d2|| ,
where • indicates vector dot product and || d || is the length of vector d.
Lect 09/10-08-09 20
• If cosine simi.= 1, then angle bet x and y is
0 and x & y are same except for
magnitude and vice-versa.
y
θ x
Lect 09/10-08-09 21
Extended Jaccard Coefficient
(Tanimoto)
• Can be used for document data
• Reduces to Jaccard for binary attributes
Lect 09/10-08-09 22
Correlation
• Correlation measures the linear relationship
between objects
• Example (Perfect Correlation): the following
2 sets of values for x and y indicate cases
where corr. Is -1 and +1.
• x=(-3,6,0,3,-6) and y=(1,-2,0,-1,2) [COMPUTE]
• x=(3,6,0,3,6) and y=(1,2,0,1,2)[COMPUTE]
Lect 09/10-08-09 23
• To compute correlation, we standardize data
objects, p and q, and then take their dot product
pk′ = ( pk − mean( p)) / std ( p)
correlation( p, q) = p′ • q′
Lect 09/10-08-09 24
Visually Evaluating Correlation
•Scatter plots showing the similarity from –1 to 1.
•Itiseasytovisualizethe
correlationbetween2data
objectsxandybyplottingpairs
•
of corre
Thsis
pofig
ndu
inre
ga
sttrib
how usteano. of
values.
suchplotswherexandy
has30attributes. Eachcircle
intheplotrepres. oneof
valueofoneoftheatt. ofx
Lect 09/10-08-09 25
whileitsycoord. isthevalue
Issues in computing proximity
measures
• 1. Handling the cases where attr. have different
scales/ & or correlated.
• 2. Computing proximity between objects when
they have different types of attr.
(qualitative/quantitative).
• 3. Handling attr. having different weights i.e.
when all attr. don’t contribute equally to the
proximity of objects.
Lect 09/10-08-09 26
Mahalanobis Distance
Mahalanobis distance
is useful when attr.
Are correlated and
have different ranges
of values and
distribution is
approx. Gaussian.
0.3 0.2
Σ=
0. 2 0. 3
C
B A: (0.5, 0.5)
B: (0, 1)
A C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
Lect 09/10-08-09 28
Combining Similarities for
heterogeneous attributes
• Sometimes attributes are of many different types,
but an overall similarity is needed.
Lect 09/10-08-09 29
Using Weights to Combine Similarities
• When some attr. are more important for
calculating proximity than others
• May not want to treat all attributes the same.
– Use weights wk which are between 0 and 1
and sum to 1, then previous eqn. becomes
Lect 09/10-08-09 30
Density
• Density-based clustering require a notion of
density
• Examples:
– Euclidean density
• Euclidean density = number of points per unit volume
– Probability density
– Graph-based density
Lect 09/10-08-09 31
Euclidean Density – Cell-based
• Simplest approach is to divide region into
a number of rectangular cells of equal
volume and define density as # of points
the cell contains
Lect 09/10-08-09 32
Euclidean Density – Center-
based
• Euclidean density is the number of points
within a specified radius of the point
Lect 09/10-08-09 33
Review Questions (Unit 2)
• 1. What are the issues related to data quality? Discuss.
• 2. What is the difference between noise and an outlier?
• 3. Data quality issues can be considered from an application point of view. Discuss.
• 4. Write a short note on Data Preprocessing.
• 5. Discuss the difference between similarity and dissimilarity.
• 6. State 3 axioms that makes a distance a metric.
• 7. Discuss similarity measures for Binary variables.
• 8. What is SMC?
• 9. How cosine similarity is used for measuring similarity between 2 document
vectors.
• 10. Discuss issues related to computation of proximity measures.
• 11.For the following vectors x, y calculate the indicated similarity or distance
measure:
(a) X=(1,1,1,1) y=(2,2,2,2) cosine, correlation, Euclidean.
(b) x=(0,1,0,1), y=(1,0,1,0) cosine, correlation, Euclidean, Jaccard, Hamming
Distance( for binary data L1 distance corresponds to Hamming distance).
Lect 09/10-08-09 34