Wang Et Al. Robust Audio Fingerprint Extraction Algorithm Based On 2-D Chroma

Robust Audio Fingerprint Extraction Algorithm Based on 2-D Chroma
Hongxue Wang, Xiaoqing Yu*, Wanggen Wan, Ram Swaminathan

Shool of Communication and Information Engineering
Institute of Smart City, Shanghai University
Shanghai HanPan Information S&T Ltd.
Shanghai 200072, China
Abstract will impact the retrieval accuracy.

c) Retrieval speed: With the quick growing of audio
Audio fingerprinting, like human fingerprint, database, an audio query to be retrieved can take a
identifies audio clips from a large number of databases long time. In particular, commercial audio
successfully, even when the audio signals are slightly retrieval systems require a relatively high retrieval
or seriously distorted. In the paper, based on 2-D, we speed. So the design of audio fingerprint and
propose an improved audio fingerprinting extraction database can reduce the retrieval time, accordingly
algorithm which was proposed by Shazam company. achieve the actual requirements.
The algorithm uses a combinatorial hashed time- Above these parameters have certain relevance; a
frequency analysis of the audio, yielding unusual parameter change will affect the performance of
properties in which multiple tracks mixed together may another parameter. Therefore, the value of the
each be identified. The results of experiment verify the parameters should be set according to the actual
improvement in the retrieval speed and accuracy. application.
Audio fingerprinting system consists of two main
parts: audio fingerprint extraction algorithm and audio
1. Introduction search algorithm. Audio fingerprint which can stand
for the characteristics of original audio is extracted
Fingerprint systems are over one hundred years old. from audio signal, then the calculation of perception
With the development of digital media technology, hash to generate a highly robust audio fingerprint;
people are routinely exposed to music in everyday In the paper, Section 2 introduces some previous
environments but are frustrated by not being able to research works on audio fingerprint. Section 3
learn more about what they hear. They may, for introduces our proposed audio fingerprinting
instance, be interested in a particular piece of music algorithms. Some preliminary experimental results are
and want to know its title and the name of the artist presented in Section 5. Finally in Section 6,
who created it. These problems promoted the conclusions and suggestions for future work are
development of a query-by-example (QBE) music collected.
search service that enables users to learn the identity of
audible prerecorded music by sampling a few seconds 2. Review previous audio fingerprint
of audio using a recording device [1].
Audio fingerprinting system has the following
algorithm
characteristics [2]:
Up to now, many researchers have started
a) Robustness: the audio signals which suffer serious
researching the field of audio and have developed the
noise or frequency domain distortion can be
audio fingerprint algorithm which some algorithm has
accurately identified. To a certain extent, audio
been commercialized. In 2004, the Gracenote Inc. and
fingerprinting carries out the invariance of the
the Royal Philips Research institute developed the
signal processing in order to get the robustness of
audio. music recognition software“Gracenote Mobile” [3]
which can be used through mobile phone.
The famous Philips’ AF algorithm [4] proposed by
*Corresponding Author: Xiaoqing Yu, Shanghai University Haitsma and Kalker has widely applied. The audio
Email Address: yxq@staff.shu.edu.cn signal sampled by 44100HZ is segmented into
b) Granularity: It means the length of the query. Hanning windowed overlapping framed at an overlap
Obviously the shorter the granularity, the better factor of 31/32.After FFT of per frame; the audio
the system performance. But too short granularity
978-1-4673-0174-9/12/$31.00 ©2012 IEEE 763 ICALIP2012

spectrum is divided into 33frequency bands which are
used to obtain 32 hash bits. Actually it’s obvious that
the internal energy difference structure of the audio
signal is relatively invariant to many types of noises
and distortions.
Another important algorithm is proposed by
Shazam Entertainment,Ltd[5] in UK. The audio signal
is transformed into a 2-D image[6] by a series of
transformation which can find the maximum energy of
the peak point in the temporal-frequency domain, then
is generated the audio fingerprint by composing a pair
of features. This algorithm does not need to retain the
global information of entire spectrum; it only extracts
part of the significant spectral component, so the
algorithm has the advantages of anti-noise
performance, fingerprint with high robustness.
3. Proposed audio fingerprinting algorithm

Figure 1. Flowchart of proposed extraction
Audio retrieval system contains two parts: scheme
fingerprint extraction and fingerprint matching
the signal-to-noise radio (SNR) of 5dB (b). It can be
algorithm.
seen that after adding noise the original signal figure is
Now audio fingerprint technology is quite mature,
not smooth, so the original signal should be processing
but in the strong noise high robustness of the
before extracting audio feature in order to make the
fingerprint extraction is still a concern. The proposed
original signal smoother and extract high robust
algorithm converts original signal to 2-D image
fingerprint. So the paper takes the 8 kHz sampling rate
analysis using local maximum chroma energy (LMCE)
to down-sampling, then window function with time
as feature points, then generate a high robust audio
length 64ms is used, finally carry through framing with
fingerprint through fingerprint modeling. We’ll
the overlapping 60% which is defined in Equation (2).
demonstrate this in details later in the next section. The
extracted fingerprint uses the hash algorithm to build S ' (k , j ) = s ( K * Ls + j )
index table, then find the most similar reference (2)
fingerprints by matching algorithm, finally the system
k = 0,1,2" N ,1 ≤ j ≤ 512
will output corresponding audio metadata.
It is well known that the Human Auditory System
(HAS) [7]is sensitive to the value of the spectrum. So
using FE which is defined as in Equation (1) to
measure the perceptual similarity of two audio signals.
w0
FE = log(∫ X ( w) dw)
2
(1) (a)
0
Where X(w) is the Fourier coefficients, is half
sampling rate.
The flowchart of extracting audio fingerprint in
Fig.1.
3.1. Frequency energy

(b)
First stage of the algorithm is to segment the audio Figure 2. Waveform of audio signal (original
signal into overlapping frames; after FFT and High- and SNR=5)
pass filter, calculate the chroma energy. Figure2 shows
the waveform of original audio (a) and the signal with Figure3 shows the image of Tempo-Frequency,
which are derived from the upper Figure2. From the
figure we can see that chroma maximum has higher
764
energy which cannot be varied under noise and other SNR=5; from the figure, redundancy can be generated
forms of distortion. Using FFT to convert the signal to which interfered by noise. So this paper introduces the
frequency domain S(k,j)=FFT(s(k,j)), then can get threshold comparison method to find peaks in order to
local energy aggregation properties after weighted- make sure the sparse uniform characteristics of the
processing, referred in Equation (3). peaks. The update of threshold in Equation (4).
i+M i+ N Thres(i, j ) = Thres(i, j ),Thres(i, j ) > S (k , n) *W ( j )

S (i, j ) = ∑∑ C (m, n) * S (m, n)
i−M i− N
(3) Thres(i, j ) = S ( k , n) * W ( j ), Thres(i, j ) < S ( k , n) * W ( j ) (4)
Where W ( j) is Gaussian window function,

Where C(m,n) for 2-D Gaussian window function,
− ( j − n )2
M and N represent the length of neighborhood, the W ( j) = e 2
paper takes M=N=2.
Figure 4. feature distribution of pure signal

(c)
Figure 5. feature distribution of audio signal

with SNR=5
(d)
Figure 3. The image of Tempo-Frequency Figure6 is feature distribution which is preserved
(original and SNR=5) with high robustness under strong noise; from the
entire spectrum range, the distribution of feature points
3.2. Audio fingerprint’s generation is more even and distribute the entire spectrum
sparsely, so the extracted feature own high robust and
This paper uses LMCE to extract the perception compactness.
features of Tempo-Frequency domain. In the
experiment less than five peaks are limited to each
frame.Fig.4 is the feature distribution of pure signal;
Fig.5 is the feature distribution of audio signal with
765
Table 1. Results of comparing Shazam with
Proposed
matching Start Value
approach ID
Number frame i=1
Shazam
481 11 3357 0
approach
Proposed
481 48 3358 -1
approach
Figure 6. feature distribution with high 5. Experimental results

robustness
The simulations are made using a music database of
After peaks are extracted, audio fingerprint is 1000songs. Including rock, pop, classical, jazz and
generated by the combination of characteristic country music, 20 music clips with the length of 5
parameters in order to improve the accuracy and seconds are randomly chosen from the database. To
retrieval speed. Select each feature as anchor then assess the robustness of the algorithm the following
paired to generate the fingerprint. distortions are applied to queries.
Set 1: Additive white noise with the different SNR at
4. Audio search algorithm 10dB, 5dB, 0dB.
Set 2: Recoding in lab, office, outdoor environment.
It is possible to perform similarity searches from Set 3: the query clip with 5s, 10s, and 15s.
large database, but two similar sound events,
containing similar or identical spectral peaks at the Table 2. the accuracy with different
same relative timing, will generate many common experiments
hashes, only enough common hashes are generated to length pure 10dB 5 dB 0 dB
differentiate with chance hash collisions. A statistical
15s 99% 97% 95% 92.5%
threshold can be set to identify the peak and detect a
match. In the case that the change of linear speed, the 10s 99% 95.5% 92% 89%
frequency components in the tempo-frequency is
essentially invariable, only the time offset changes; 5s 98.6% 93% 88% 80%
Original algorithm may lead to low accuracy rate, so in
order to reduce distortion, a tolerable range is set to the From the table2, noise intensity and time length of
time offset which referred in Equation(5): query will have different fluencies on the accuracy rate
of the retrieval system. The longer query clips, the
Hash = [ f 1, Δf , Δt + i ] = [ f 1, Δf , Δt ] (5) better accuracy rate.
Where f1 is start frequency; is the frequency
6. Conclusion
change of between two features; is the time change.
Under different noise environments, i can be amended In this letter, we introduce a robust audio
to increase the recognition rate, we take i=-1, 0, 1. fingerprinting algorithm. The audio fingerprints are
From the table1, in the case of distortion, the produced based on 2-Dimagel, which is expected to be
proposed algorithm has the better distinguish and highly robust to noises and distortions. In the retrieval
higher accuracy. process, inverted file index technique combined with a
special-designed hashing function is adopted to reduce
the retrieval time. Preliminary experimental results
suggest that the proposed audio fingerprinting
algorithm can work well in broadcast monitoring
application with good discrimination and recognition
accuracy.
766
For the present the proposed scheme has not yet
be evaluated in a variety of noises and degradations. It
is possible that the minimal bit error rate will be raised
significantly and the recognition accuracy will decline
correspondingly. In the future we’ll focus on testing
using degraded audios samples and improve the
proposed algorithm accordingly.
7. Acknowledgements
This work was supported in part by Shanghai’s Key
Discipline Development Program under Grant No.
J50104. It was also supported in part by Datasentric
Inc, 38660 Lexington Street No.440 Fremont, CA,
USA.
References
[1] Avery Wang. “The Shazam Music Recognition Service”.
Communications of the ACM, Vol.49, No.8, August 2006.
[2] J.T.Foote, “An Overview of Audio Information
Retrieval”,ACM-SpringerMultimediaSystem,Vol.7,no.1,pp.2-
11,ACMPress/Springer-Verlag,Jan.1999.
[3] Yuan-Yuan Shi, Xuan Zhu, Hyoung-Gook Kim, Ki-
Wan Eom. A Robust Music Retrieval System.AES 120th
Convention, Paris, France, pp.20-30, 2006.
[4] J. Haitsma and T. Kalker. “A Highly Robust Audio
Fingerpriting System”, Proc. 3rd ISMIR, pp. 144-148,
2002.
[5] A.Wang. “An industrial strength audio search algorithm”,
Proc. 4th ISMIR, pp. 7–13, 2003.
[6] B.Zhu,W Li, “A Novel Audio Fingerprinting Method
Robust to Time Scale Modification and Pitch Shifting” Proc.
10thMM, pp. 25–29, 2010.
[7] Lotter, T., Vary, P. “Speech enhancement by MAP
spectral amplitude estimation using a super-Gaussian speech
model”. EURASIPJ. Appl. Signal Proc. 7, pp.1110–1126,
2005.
767

Wang Et Al. Robust Audio Fingerprint Extraction Algorithm Based On 2-D Chroma

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Wang Et Al. Robust Audio Fingerprint Extraction Algorithm Based On 2-D Chroma

Hochgeladen von

Copyright:

Verfügbare Formate

Robust Audio Fingerprint Extraction Algorithm Based on 2-D Chroma

Hongxue Wang, Xiaoqing Yu*, Wanggen Wan, Ram Swaminathan

Abstract will impact the retrieval accuracy.

978-1-4673-0174-9/12/$31.00 ©2012 IEEE 763 ICALIP2012

3. Proposed audio fingerprinting algorithm

3.1. Frequency energy

i+M i+ N Thres(i, j ) = Thres(i, j ),Thres(i, j ) > S (k , n) *W ( j )

Where W ( j) is Gaussian window function,

Figure 4. feature distribution of pure signal

Figure 5. feature distribution of audio signal

Figure 6. feature distribution with high 5. Experimental results

Das könnte Ihnen auch gefallen