Beruflich Dokumente
Kultur Dokumente
(values mapped to
integers 0 to n-1 where
n is the number of
values)
Distances:
Distances are dissimilarities with certain properties.
The Euclidian distance, d, between two points , x
and y in one , two or higher dimensional space is
given by the formula:
d(x, y) = √∑𝑛𝑘=1(𝑥𝑘 − 𝑦𝑘 )2
where n is the number of dimensions and xk and yk
are, respectively, the kth attribute (component) of x
and y.
The Euclidian distance measure is given generalized
by the Minkowski distance metric shown as:
d(x,y) = (∑𝑛𝑘=1 |𝑥𝑘 − 𝑦𝑘 |𝑟 )1/𝑟
where r is a parameter.
The following are the 3 most common examples of
Minkowski distances:
r = 1 also known as City block (Manhattan or L1
norm) distance. A common example is the
Hamming distance, which is the number of
bits that are different between two objects that
only have binary attributes (i.e., binary vectors)
r=2. Euclidian distance (L2 norm).
r= . Supremum, (Lmax or L norm) distance.
This is the maximum difference between any
attributes of the objects. The L is defined more
formally by:
d(x, y)= lim𝑟→∞ (∑𝑛𝑘=1 |𝑥𝑘 − 𝑦𝑘 |𝑟 )1/𝑟
L2 distance:
p1 p2 p3
p1 0 √𝟓 √𝟓
p2 √𝟓 0 √𝟐
p3 √𝟓 √𝟐 0
L1 distance:
p1 p2 p3
p1 0 3
p2 3 0 2
p3 3 2 0
Cosine Similarity:
Documents are often represented as vectors, in
which each attribute represents the frequency with
which a particular term (word) occurs in the
document. Even though documents may have
thousands or tens of thousands of words, each
document is sparse since it has few non zero
attributes. Therefore, a similarity measure for
documents needs to ignore 0-0 matches like he
Jaccard measure, but must handle non-binary
vectors. The cosine similarity is the most common
measure of document similarity.
if x and y are 2 documents, then:
𝑥.𝑦
cos(x,y)=
||𝑥||||𝑦||′
Correlations:
The correlation between two data objects that have
binary or continuous variables is a measure of the
linear relationship between the attributes of the
objects. More precisely, Pearson’s correlation
coefficient between two data objects , x and y, is
defined by the following equation:
𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑥,𝑦) 𝑠𝑥𝑦
corr(x,y)= =
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛(𝑥)∗𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛(𝑦) 𝑠𝑥 𝑠𝑦
1
standard deviation(x) = sx= √ ∑𝑛𝑘=1(𝑥𝑘 − 𝑥̅ )2
𝑛−1
1
standard deviation(y) = sy= √ ∑𝑛𝑘=1(𝑦𝑘 − 𝑦̅)2
𝑛−1
1
𝑥̅ = ∑𝑛𝑘=1 𝑥𝑘 is the mean of x (average)
𝑛
1
𝑦̅ = ∑𝑛𝑘=1 𝑦𝑘 is the mean of y (average)
𝑛
Perfect correlation:
Correlation is always in the range of -1 to 1. A
correlation of 1 (-1) means that x and y have a
perfect positive (negative) linear relationship; that is
xk= ayk+b