Gemini Algorithm

GEMINI
GEneric Multimedia INdexIng
 GEneric Multimedia INdexIng

 distance measure
 Sub-pattern Match
 ‘quick and dirty’ test
 Lower bounding lemma
 1-D Time Sequences

 Color histograms
 Color auto-correlogram
 Shapes
1
 Given a database of multimedia objects
 Design fast search algorithms that locate objects that
match a query object, exactly or approximately
 Objects:
• 1-d time sequences
• Digitized voice or music
• 2-d color images
• 2-d or 3-d gray scale medical images
• Video clips
 E.g.: “Find companies whose stock prices move

similarly”
Applications
 time series:
• financial, marketing (click-streams!), ECGs,
sound;
 images:
• medicine, digital libraries, education, art
 higher-d signals:
• scientific db (eg., astrophysics), medicine (MRI
scans), entertainment (video)
2
Sample queries
 Find medical cases similar to Smith's

 Find pairs of stocks that move in sync
 Find pairs of documents that are similar
(plagiarism?)
 Find faces similar to ‘Tiger Woods’
$price
$price
1 365
day
$price 1 365
day
distance function: by expert
1 365
(eg, Euclidean distance)
day
3
Generic Multimedia Indexing
 1st step: provide a measure for the

distance between two objects
 Distance function d():
• Given two objects O1, O2 the distance (=dis-
similarity) of the two objects is denoted by
d(O1, O2)
E.g., Euclidean distance (sum of squared

differences) of two equal-length time series
ε-Similarity query
 Given a query object Q, find all objects Oi
from the database are ε-similar (identical
for ε = 0) to Q
 {Oi DB | d(Q , Oi) < ε}.
4
Types of Similarity Queries
 Whole match queries:

• Given a collection of S objects O1,…, Os and a query
object Q find data objects that are within distance ε
from Q
Types of Similarity Queries
 Sub-pattern Match:
• Given a collection of S objects O1,…, OS and a query
(sub-) object Q and a tolerance ε identify the parts of the
data objects that match the query Q
5
Idea method – requirements
 Fast: sequential scanning and distance

calculation with each and every object too
slow for large databases
 Dynamic: easy to insert, delete, and

update objects
Basic idea
 Focus on ‘whole match’ queries
• Given a collection of S objects O1,…, Os, a
distance/dis-similarity function d(Oi, Oj), and a
query object Q find data objects that are within
distance ε from Q
 Sequential scanning?
May be too slow.. for the following
reasons:
• Distance computation is expensive (e.g., editing
distance in DNA strings)
• The Database size S may be huge
 Faster alternative?
6
Christos
Faloutsos
QBIC 1994
• A feature extraction function maps the high dimensional

objects into a low dimensional space
• Objects that are very dissimilar in the feature space, are
also very dissimilar in the original space
Basic idea
 Faster alternative:
 Step 1: a ‘quick and dirty’ test to discard quickly
the vast majority of non-qualifying objects
 Step 2: use of SAMs (R-trees, Hilbert-Curve,..) to
achieve faster than sequential searching
 Example:
 Database of yearly stock price movements
• Euclidean distance function
• Characterize with a single number (‘feature’)
• Or use two or more features
7
Basic idea - illustration
Feature2
S1 F(S1)
1 365
day F(Sn)
Sn Feature1
1 365
day
 A query with tolerance ε becomes a sphere with radius ε
Basic idea – caution!

 The mapping F() from objects to k-dim. points
should not distort the distances
 d(): distance of two objects
 dfeature(): distance of their corresponding feature
vectors
 Ideally, perfect preservation of distances
 In practice, a guarantee of no false dismissals
 How?
8
 Objects represented by vectors that are very
dissimilar in the feature space are expected to
be very dissimilar in the original space
 If the distances in the feature space are always

smaller or equal than the distances in the
original space, a bound which is valid in both
spaces can be determined
 The distance of similar objects is smaller

or equal to ε in the original space and,
consequently, it is smaller or equal to ε in
the feature space as well...
9
Lower bounding lemma
 if distance of similar “objects“ is smaller

or equal to ε in original space
 then it is as well smaller or equal ε in the
feature space
d feature (F(O1 ),(O2 )) " d(O1,O2 ) " #

o.k.
d(O1,O2 ) " # $$ % d feature (F(O1 ),F(O2 ))
d feature (F(O1 ),F(O2 )) " # $WRONG!
$ $% d(O1,O2 ) " #
d feature (F(O1 ),F(O2 )) " # " d(O1,O2 ) " ?
 No object in the feature space will be missed

(false dismissals) in the feature space
!  There will be some objects that are not similar
in the original space (false hints/alarms)
10
 That means that we are guaranteed to have
selected all the objects we wanted plus some
additional false hits in the feature space
 In the second step, false hits have to be filtered

from the set of the selected objects through
comparison in the original space
11
Time sequences
white noise brown noise
Fourier
spectrum
... in log-log
Time sequences
 Conclusion: colored noises are well
approximated by their first few Fourier
coefficients
 Colored noises appear in nature
12
Time sequences
 Eg.:
GEMINI
Important:
Q: how to extract features?

A: “if I have only one number to describe
my object, what should this be?”
13
1-D Time Sequences
 Distance function: Euclidean distance
 Find features that:
 Preserve/lower-bound the distance
 Carry as much information as possible(reduce false
alarms)
 If we are allowed to use only one feature what
would this be? The average
 … extending it…
1-D Time Sequences

 ......
 If we are allowed to use only one feature what
would this be? The average
 … extending it…
 The average of 1st half, of the 2nd half, of the

1st quarter, etc.
 Coefficients of the Fourier transform (DFT),
wavelet transform, etc.
14
Feature extracting function
1. Define a distance function
2. Find a feature extraction function F() that
satisfies the bounding lemma
Example:
 Discrete Fourier Transform (DFT) preserve
Euclidian distances between signals (Parseval's
theorem)
 F() = DTF which keeps the first coefficients of the
transform
1-D Time Sequences

Show that the distance in feature space lower-bounds the actual
distance
DFT?
Parseval’s Theorem: DFT preserves the energy of the signal as
well as the distances between two signals
d(x,y) = d(X,Y)
where X and Y are the Fourier transforms of x and y
If we keep the first k ≤ n coefficients of DFT we lower-bound the

actual distance
k"1 2 n"1 2 n"1 2
d feature (F(x),F(y)) = # X f " Y f $ # X f " Y f = # x i " y i % d(x, y)

f =0 f =0 i= 0
15
Time sequences - results
 keep the first 2-3 Fourier coefficients
 faster than seq. scan
 no false dismissals
total
time cleanup-time
r-tree time
# coeff. kept
Time sequences -
improvements:
 could use Wavelets, or DCT

 could use segment averages
16
Images - color
what is an image?
A: 2-d array
2-D color images – Color histograms
 Each color image – a 2-d array of pixels

 Each pixel – 3 color components (R,G,B)
 h colors – each color denoting a point in 3-d color
space (as high as 224 colors)
 For each image compute the h-element color
histogram – each component is the percentage of
pixels that are most similar to that color
 The histogram of image I is defined as:
For a color Ci , Hci(I) represents the number of pixels of
color Ci in image I
OR:
For any pixel in image I, Hci(I) represents the possibility of
that pixel having color Ci.
17
2-D color images – Color histograms
 Usually cluster similar colors together and choose one
representative color for each ‘color bin’
 Most commercial CBIR systems include color histogram as
one of the features (e.g., QBIC of IBM)
 No space information
Color histograms - distance
 One method to measure the distance between two

histograms x and y is:
h h
d h2 ( x, y ) = ( x " y ) t # A # ( x " y ) = !! aij ( xi " yi )( x j " y j )
i j
where the color-to-color similarity matrix A has entries

aij that describe the similarity between color i and color j
18
Images - color
Mathematically, the distance function is:
Color histograms – lower bounding

 1st step: define the distance function between two color
images d()=dh()
 2nd step: find numerical features (one or more) whose
Euclidean distance lower-bounds dh()
 If we allowed to use one numerical feature to describe
the color image what should it be?
 Avg. amount for each color component (R,G,B)
 x = ( Ravg , Gavg , Bavg ) t
P
 Where Ravg = (1 / P ) ! R( p)… , and similarly for G and B
p =1
Where P is the number of pixels in the image, R(p) is the red

component (intensity) of the p-th pixel
19
Color histograms – lower bounding
 x of two
Given the average color vectors and y images we define davg() as
the Euclidean distance between the 3-d average color vectors
3
2
d avg ( x , y ) = ( x " y ) t # ( x " y ) = ! ( xi " yi ) 2
i =1
 3rd step: to prove that the feature distance davg() lower-bounds the actual
distance dh()...
• ...by the ``Quadratic Distance Bounding'' theorem it is guaranteed that the
distance between vectors representing histograms is bigger or equal as the
distance between histograms of average color images. The proof of the
``Quadratic Distance Bounding'' theorem is based upon the unconstrained
minimization problem using Langrange multipliers
 Main idea of approach:
 First a filtering using the average (R,G,B) color,
 then a more accurate matching using the full h-element histogram
Images - color
time
seq scan
performance:
w/ avg RGB
selectivity
20
Color auto-correlogram
 pick any pixel p1 of color Ci in the image I
 at distance k away from p1 pick another pixel p2
 what is the probability that p2 is also of color Ci ?
Red ?
k P2
P1
Image: I
 The auto-correlogram of image I for color Ci ,
distance k:
$ C( ki ) ( I ) # Pr[| p1 " p2 |= k , p2 ! I Ci | p1 ! I Ci ]
 Integrate both color information and space

information
21
Implementations
 Pixel Distance Measures
 Use D8 distance (also called chessboard distance):
dmax ( p,q) = max(| px " qx |,| py " qy |)
 Choose distance k=1,3,5,7

 Computation
! complexity:
• Histogram:
!( n 2 )
• Correlogram: !(134 * n 2 )
22
Implementations
 Features Distance Measures:
 D( f(I1) - f(I2) ) is small  I1 and I2 are similar
 m= R,G,B k=distance
 or histogram:
| hCi ( I ) # hCi ( I ' ) |
| I # I ' |h $ ! 1+ h
i"[ m ] Ci ( I ) + hCi ( I ' )
 For correlogram:
| % C( ki ) ( I ) # % C( ki ) ( I ' ) |
| I # I ' |% $ !
i"[ m ], k"[ d ] 1 + % C( ki ) ( I ) + % C( ki ) ( I ' )
Color Histogram vs Correlogram
Correlogram
method
Query Image 1st 2nd 3rd 4th 5th

(512 colors)
Histogram
method
1st 2nd 3rd 4th 5th
23
Query
Correlogram method: 1st

Histogram method: 48th
Target
Query
Correlogram method: 1st

Histogram method: 31th
Target
24
Query 1 Query 2 Query 3 Query 4
Target C: 178th C: 1st C: 1st C: 5th

H: 230th H: 1st H: 3rd H: 18th
 The correlogram method is more stable to contrast &

brightness change than the histogram method.
 The color correlogram describes the global

distribution of local spatial correlations of colors.
 It’s easy to compute
 It’s more stable than the color histogram method
25
Images - shapes
 Distance function: Euclidean, on the area
 Q: how to do dim. reduction?
 A: Karhunen-Loeve (PCA)
Images - shapes
 Performance: ~10x faster
log(# of I/Os)
all kept
# of features kept
26
Mutlimedia Indexing – Conclusions
 GEMINI is a popular method

 Whole matching problem
 Should pay attention to:
• Distance functions
• Feature Extraction functions
• Lower Bounding
• Particular application
Conclusions
 GEMINI works for any setting (time
sequences, images, etc)
 uses a ‘quick and dirty’ filter
 faster than seq. scan
27
 GEneric Multimedia INdexIng
 distance measure
 Sub-pattern Match
 ‘quick and dirty’ test
 Lower bounding lemma
 1-D Time Sequences

 Color histograms
 Color auto-correlogram
 Shapes
28

Gemini Algorithm

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Gemini Algorithm

Hochgeladen von

Copyright:

Verfügbare Formate

GEMINI

GEneric Multimedia INdexIng

 GEneric Multimedia INdexIng

 1-D Time Sequences

 E.g.: “Find companies whose stock prices move

 Find medical cases similar to Smith's

distance function: by expert

 1st step: provide a measure for the

E.g., Euclidean distance (sum of squared

 {Oi DB | d(Q , Oi) < ε}.

 Whole match queries:

Types of Similarity Queries

 Fast: sequential scanning and distance

 Dynamic: easy to insert, delete, and

• A feature extraction function maps the high dimensional

 A query with tolerance ε becomes a sphere with radius ε

Basic idea – caution!

 If the distances in the feature space are always

 The distance of similar objects is smaller

 if distance of similar “objects“ is smaller

d feature (F(O1 ),(O2 )) " d(O1,O2 ) " #

 No object in the feature space will be missed

 In the second step, false hits have to be filtered

 Colored noises appear in nature

Q: how to extract features?

1-D Time Sequences

 The average of 1st half, of the 2nd half, of the

1-D Time Sequences

If we keep the first k ≤ n coefficients of DFT we lower-bound the

d feature (F(x),F(y)) = # X f " Y f $ # X f " Y f = # x i " y i % d(x, y)

 could use Wavelets, or DCT

2-D color images – Color histograms

 Each color image – a 2-d array of pixels

Color histograms - distance

 One method to measure the distance between two

where the color-to-color similarity matrix A has entries

Color histograms – lower bounding

Where P is the number of pixels in the image, R(p) is the red

 Integrate both color information and space

dmax ( p,q) = max(| px " qx |,| py " qy |)

 Choose distance k=1,3,5,7

Color Histogram vs Correlogram

Query Image 1st 2nd 3rd 4th 5th

1st 2nd 3rd 4th 5th

Correlogram method: 1st

Color Histogram vs Correlogram

Correlogram method: 1st

Query 1 Query 2 Query 3 Query 4

Target C: 178th C: 1st C: 1st C: 5th

 The correlogram method is more stable to contrast &

Color Histogram vs Correlogram

 The color correlogram describes the global

 Q: how to do dim. reduction?

 GEMINI is a popular method

 faster than seq. scan

 1-D Time Sequences

Das könnte Ihnen auch gefallen