Sie sind auf Seite 1von 7

Proposed Technique for Content Based Sound Analysis and

Ordering Using CASA and PAMIR Algorithms


T.K.Senthil Kumar S.Rajalingam
Lecturer, Department of ECE Lecturer, Department of ECE
Rajalakshmi Institute of Technology Rajalakshmi Institute of Technology
Kuthambaakam Post, Chennai – 602107 Kuthambaakam Post, Chennai – 602107
Email : tkseneee@gmail.com Email : rajalingvlsi@gmail.com
Mobile : 9941265572 Mobile: 9791499870

Abstract: separately. Separating speech from


Making the machine to hear as a interference is one such application. This
human is one of the emerging paper deals with the retrieval of the
technology of current technical world. If sound from the text queries using CASA
we can make machines to hear as and PAMIR algorithms with Pole-Zero
humans, then we can use them to easily filter cascade peripheral model. This
distinguish speech from music and paper work on content-based sound
background noises, to separate out the ranking system that uses biologically
speech and music for special treatment, inspired auditory features and
to know from direction sounds are successfully learns a matching between
coming, to learn which noises are typical acoustics and known text.
and which are noteworthy. These Keyword : PAMIR, PZFC, sparse
machines should be able to listen and Coding
react in real time, to take appropriate
action on hearing noteworthy events, to 1. Introduction
participate in ongoing activities, whether Machine Hearing is a field aiming to
in factories, in musical performances, or develop systems that can process,
in phone conversations. The existing identify and classify the full set of
auditory models for automatic speech sounds that people are exposed to. Like
recognition (ASR) has not been entirely machine vision, machine hearing
successful, due to the highly evolved involves multiple problems: from
state of ASR system technologies [1], auditory scene analysis, through
which are finely tuned to existing “auditory object” recognition to speech
representations and to how phonetic processing and recognition. While
properties of speech are manifest in considerable effort has been devoted to
those representations. speech and music related research, the
One particularly promising area wide range of sounds that people – and
of machine hearing research is machines – may encounter in their
computational auditory scene analysis everyday life has been far less studied.
(CASA).[1] To the extent that we can Such sounds cover a wide variety of
analyze sound scenes into separate objects, actions, events, and
meaningful components, we can achieve communications: from natural ambient
an advantage in tasks involving sounds, through animal and human
processing of those components vocalizations, to artificial sounds that are
abundant in today’s environment. retrieval aspects of that task, using
Building an artificial system that standard acoustic representations. Here
processes and classifies many types of we focus on the complementary
sounds poses two major challenges. problem, of finding a good
First, we need to develop efficient representation of sounds using a given
algorithms that can learn to classify or learning algorithm. The current paper
rank a large set of different sound proposes a representation of sounds that
categories. Recent developments in is based on models of the mammalian
machine learning, and particularly auditory system. Unlike many
progress in large scale methods provide commonly used representations, it
several efficient algorithms for this task. emphasizes fine timing relations rather
Second, and sometimes more than spectral analysis. We test this
challenging, we need to develop a representation in a quantitative task:
representation of sounds that captures ranking sounds in response to text
the full range of auditory features that queries.
humans use to discriminate and identify
different sounds, so that machines have a 2. Modeling Sounds
chance to do so as well. To evaluate and In this paper we focus on a class of
compare auditory representations, we representations that is partially based on
use a real-world task of content-based models of the auditory system and
ranking sound documents given text compare these representations to
queries. In this application, a user enters standard mel-frequency cepstral
a textual search query, and in response is coefficients (MFCCs). The motivation
presented with an ordered list of sound for using auditory models follows from
documents, ranked by relevance to the the observation that the auditory system
query. For instance, a user typing “dog” is very effective at identifying many
will receive an ordered set of files, sounds, and this may be partially
where the top ones should contain attributed to the acoustic features that are
sounds of barking dogs. Importantly, extracted at the early stages of auditory
ordering the sound documents is based processing. We extract features with a
solely on acoustic content: no text four-step process, illustrated in Fig. 1:
annotations or other metadata are used at (1) A nonlinear filter bank with half-
retrieval time. Rather, at training time, a wave rectified output. (2) Strobed
set of annotated sound documents temporal integration, that yields a
(sound files with textual tags) is used, stabilized auditory image (SAI). (3)
allowing the system to learn to match the Sparse coding using vector quantization.
acoustic features of a dog bark to the text (4) Aggregate all frames features to
tag “dog”, and similarly for a large set of represent the full audio document. The
potential sound-related text queries. In first two steps, filter bank and strobed
this way, a small labeled set can be used temporal integration, are firmly rooted in
to enable content-based retrieval from a auditory physiology and psychoacoustics
much larger, unlabeled set. Several The third processing step, sparse coding,
previous studies have addressed the is in accordance with some properties of
problem of content-based sound neural coding and has significant
retrieval, focusing mostly on the computational benefits that allow us to
machine-learning and information- train large scale models. The fourth step
takes a “bag of features” approach which
is common in machine vision and
information retrieval. The remainder of
this section describes these three steps in
detail.

Figure 2

2.2 Strobe finding and image


stabilization
The second processing step, strobed
temporal integration (STI), is based on
human perception of sounds, rather than
Figure 1 purely on the physiology of the auditory
system. In this step, PZFC output is
passed through a strobe-finding process,
which determines the position of
2.1 Cochlear model filterbank “important” peaks in the output in each
The first processing step is a cascade channel. These strobe points are used to
filterbank inspired by cochlear dynamics initiate temporal integration processes in
[2], known as the pole–zero filter cascade
each channel, adding another dimension
(PZFC) (Figure 2). It produces a bank of to represent time delay from the strobe,
bandpass-filtered, half-wave rectified or trigger, points. Intuitively, this step
output signals that simulate the output of “stabilizes” the signal, in the same way
the inner hair cells along the length of that the trigger mechanism in an
the cochlea. The PZFC can be viewed as oscilloscope makes a stable picture from
approximating the auditory nerve’s an ongoing time-domain waveform. The
instantaneous firing rate as a function of end result of this processing is a series of
cochlear place, modeling both the two-dimensional frames of real- valued
frequency filtering and the compressive data (a “movie”), known as a “stabilized
or automatic gain control characteristics auditory image” (SAI) [3]. Each frame in
of the human cochlea.[2] The PZFC also this “movie” is indexed by cochlear
models the adaptive and frequency channel number on the vertical axis and
dependent gain that is observed in the lags relative to identified strobe times on
human cochlea, thereby making an the horizontal axis. Examples of such
automatic gain control (AGC) system. frames are illustrated in Fig. 3 and Fig.4.
specific scene in a video where a
breaking glass can be heard. A similar
task is “musical query-by-description”,
in which a relation is learned between
audio documents and words. We solve
the ranking task in two steps. In the first
Figure 3 step, sound documents are represented as
sparse vectors. In the second step, we
train a machine learning system to rank
the documents using the extracted
features. In this study we use the PAMIR
method as a learning algorithm.
PAMIR uses a fast and robust training
procedure to optimize a simple linear
mapping from features to query terms,
Figure 4 given training data with known tags. The
query is represented as a sparse vector of
terms in the tag vocabulary (about 3,000
words), and each sound file is given a
2.3 Sparse coding of an SAI score respect to a query via a linear
The third processing step transforms the matrix product of features times matrix
content of SAI [3] frames into a sparse times query. The matrix is trained to
code that captures repeating local optimize a ranking criterion, such that it
patterns in each SA image. Sparse codes attempts to rank "relevant" documents
have become prevalent in the higher, by giving them a higher score,
characterization of neural sensory than "nonrelevant" ones, in the training
systems. As such it provides a powerful set, for a large number of training
representation that can capture complex queries that include multiword queries
structures in data, while providing formed from the tag vocabulary.
computational efficiency. Specifically,
sparse codes can focus on typical 4 Experiments
patterns that frequently occur in the data, We planned to evaluate the auditory
and use their presence to represent the representation in a quantitative ranking
data efficiently. task using a large set of audio recordings
that cover a wide variety of sounds. We
3. Ranking sounds given text queries compare sound retrieval based on the
We now address the problem of ranking SAI with standard MFCC features. In
sound documents by their relevance to a what follows we describe the dataset
text query. Practical uses of such a and the experimental setup
system include searching for sound files
or specific moments in the sound track 4.1 The dataset
of a movie. For instance, a user may be We planned to collect a few thousands
interested to find vocalizations of of sound effect from multiple sources.
monkeys to be included in a presentation We planned to collect those from
about the rain-forest, or to locate the commercially available sound effect
collections, BBC sound effects library
and through a variety of web sites; contained no other tag. We will use a
www.findsounds.com, acoustica.com, second level of cross validation to
ilovewavs.com, simplythebest.net, wav- determine the values of the hyper
sounds.com, wavsource.com, and parameters: the aggressiveness
wavlist.com. We plan to manually label parameter C, and the number of training
all of the sound effects by listening to iterations. In general performance was
them and typing in a handful of tags for good as long as C was not too high, and
each sound. This was used for adding lower C values required longer training.
tags to existing tags We selected a value of C = 0.1, which
(fromwww.findsounds.com) and to tag was also found to work well in other
the non-labeled files from other sources. applications and 10M iterations. From
When labeling, the original file name our study the system is not very sensitive
was displayed, so the labeling decision to the value of these parameters. To
was influenced by the description given evaluate the quality of the ranking
by the original author of the sound obtained by the learned model we can
effect. We like to restrict our tags to a use the precision (fraction of positives)
somewhat limited set of terms. We also within the top k audio documents from
added high level tags to each file. For the test set as ranked for each query.
instance, files with tags such as ‘rain’,
‘thunder’ and ‘wind’ were also given the 4.3 SAI and sparse coding parameters
tags ‘ambient’ and ‘nature’. Files tagged The process of transformation of SAI
‘cat’, ‘dog’, and ‘monkey’ were frames into sparse codes [6] has several
augmented with tags of ‘mammal’ and parameters which can be varied. We
‘animal’. These higher level terms assist plan to define a default parameter set
in retrieval by inducing structure over and then performed experiments in
the label space. All terms are stemmed, which one or a few parameters were
using the Porter stemmer for English. varied from this default set. The default
After stemming, we planned to have parameters cut the SAI into rectangles
around 3000 tags. starting with the smallest size of 16 lags
by 32 channels, leading to a total of 49
4.2 The Experimental Setup rectangles. All the rectangles were
We planned to use standard cross reduced to 48 marginal values each, and
validation [5] to estimate performance of for each box a codebook of size 256, for
the learned ranker. Specifically, we like a total of 49 × 256 = 12544 feature
to split the set of audio documents in dimensions. Using this default
three equal parts, using two thirds for experiment as a baseline for
training and the remaining third for comparisons, we can make systematic
testing. Training and testing was variations to several parameters and
repeated for all three splits of the data, studied their effect on the retrieval
such that we obtained an estimate of the precision. First,we modify two
performance on all the documents. We parameters that determine the shape of
will remove from the training and the the PZFC filter: Pdamp and Zdamp.[6]
test set queries that had fewer than k = Then, we modified the smallest rectangle
5 documents in either the training set or size used for sparse segmentation and by
the test set, and removed the limiting the maximum number of
corresponding documents if these rectangles used for the sparse
segmentation. Further variants used
systematic variation of the codebook that are most discriminative. Since our
sizes used in sparse coding. system currently uses only features from
short windows, we envision future work
5 Conclusion to incorporate more dynamics of the
We described a content-based sound sound over longer times, either as a bag-
ranking system that uses biologically of-patterns using patterns that represent
inspired auditory features and more temporal context, or through other
successfully learns a matching between methods.
acoustics and known text labels. We
described PAMIR to study References:
systematically many alternative sparse- [1] R. F. Lyon, A. C. Kat siam is, and E. M.
feature representations (“front ends”). Drakakis,"History and future of auditory filter
models," in Proc. IEEE Int. Conf Circuits and
Our analysis support the hypothesis a Systems, 2010,pp.3809-3812.
front end that mimics several aspects of
the human auditory system provides an [2] R. F. Lyon, "Filter cascades as analogs of the
effective representation for machine cochlea," in Neuromorphic Systems
hearing. These aspects include a realistic Engineering:
Neural Networks in Silicon , T. S. Lande, Ed.
nonlinear adaptive filter bank and a stage Norwell, MA: I(luwer, 1998, pp. 3-18.
that exploits temporal fine structure at
the filter bank output (modeling the [3] R. D. Patterson, K. Robinson, J. Holdsworth,
cochlear nerve) via the concept of the D. McI(eown , C. Zhang, and M. Aller hand ,
stabilized auditory image. Importantly "Complex sounds and auditory images," in Proc.
9th Int. Symp. Hearing, Auditory Physiology and
however, the auditory model described Perception, Y. Cazals, L. Demany, and Ie
in this paper may not be always optimal, Horner,Eds. Oxford: Pergamon, 1992, pp. 429-
and future work on characterizing the 446.
optimal parameters and architecture of
auditory models is expected to further [4] M. Slaney and R. F. Lyon, "On the
importanceof time-A temporal representation of
improve the precision, depending on the
time," in Visual Representations of Speech
task at hand. One approach to feature Signals, M. Cooke,S. Beet, and M. Crawford,
construction would have been to Eds. New York: Wiley,1993, pp. 95-116.
manually construct features that are
expected to discriminate well between [5] D. Crangier and S. Bengio, "A neural
network to retrieve images from text queries," in
specific classes of sounds. For instance,
Proc. Artificial Neural Networks-ICANN 2006,
periodicity could be a good 2006, pp. 24-34.
discriminator between wind in the trees
and a howling wolf. However, as [6] M. Rehn, R. F. Lyon, S. Bengio, T. C.
number of classes grows, such careful Walters,and C. Chechik, "Sound ranking using
design of discriminative features may auditory sparse-code representations," in ICML
Workshop Sparse Methods for Music Audio,
become infeasible. Here we take an 2009.
opposite approach, assuming that
perceptual differences rely on lower
level cochlear feature extraction, we
proposed the models inspired by
cochlear processing to obtain a very high
dimensional representation, and let the
learning algorithm identify the features

Das könnte Ihnen auch gefallen