0 Bewertungen0% fanden dieses Dokument nützlich (0 Abstimmungen)
26 Ansichten7 Seiten
Machine hearing is a field aiming to develop systems that can process, identify and classify the full set of sounds that people are exposed to. This paper work on content-based sound ranking system that uses biologically inspired auditory features and successfully learns a matching between acoustics and known text. If we can make machines to hear as humans, then we can use them to easily distinguish speech from music and background noises.
Machine hearing is a field aiming to develop systems that can process, identify and classify the full set of sounds that people are exposed to. This paper work on content-based sound ranking system that uses biologically inspired auditory features and successfully learns a matching between acoustics and known text. If we can make machines to hear as humans, then we can use them to easily distinguish speech from music and background noises.
Copyright:
Attribution Non-Commercial (BY-NC)
Verfügbare Formate
Als DOC, PDF, TXT herunterladen oder online auf Scribd lesen
Machine hearing is a field aiming to develop systems that can process, identify and classify the full set of sounds that people are exposed to. This paper work on content-based sound ranking system that uses biologically inspired auditory features and successfully learns a matching between acoustics and known text. If we can make machines to hear as humans, then we can use them to easily distinguish speech from music and background noises.
Copyright:
Attribution Non-Commercial (BY-NC)
Verfügbare Formate
Als DOC, PDF, TXT herunterladen oder online auf Scribd lesen
Proposed Technique for Content Based Sound Analysis and
Ordering Using CASA and PAMIR Algorithms
T.K.Senthil Kumar S.Rajalingam Lecturer, Department of ECE Lecturer, Department of ECE Rajalakshmi Institute of Technology Rajalakshmi Institute of Technology Kuthambaakam Post, Chennai – 602107 Kuthambaakam Post, Chennai – 602107 Email : tkseneee@gmail.com Email : rajalingvlsi@gmail.com Mobile : 9941265572 Mobile: 9791499870
Abstract: separately. Separating speech from
Making the machine to hear as a interference is one such application. This human is one of the emerging paper deals with the retrieval of the technology of current technical world. If sound from the text queries using CASA we can make machines to hear as and PAMIR algorithms with Pole-Zero humans, then we can use them to easily filter cascade peripheral model. This distinguish speech from music and paper work on content-based sound background noises, to separate out the ranking system that uses biologically speech and music for special treatment, inspired auditory features and to know from direction sounds are successfully learns a matching between coming, to learn which noises are typical acoustics and known text. and which are noteworthy. These Keyword : PAMIR, PZFC, sparse machines should be able to listen and Coding react in real time, to take appropriate action on hearing noteworthy events, to 1. Introduction participate in ongoing activities, whether Machine Hearing is a field aiming to in factories, in musical performances, or develop systems that can process, in phone conversations. The existing identify and classify the full set of auditory models for automatic speech sounds that people are exposed to. Like recognition (ASR) has not been entirely machine vision, machine hearing successful, due to the highly evolved involves multiple problems: from state of ASR system technologies [1], auditory scene analysis, through which are finely tuned to existing “auditory object” recognition to speech representations and to how phonetic processing and recognition. While properties of speech are manifest in considerable effort has been devoted to those representations. speech and music related research, the One particularly promising area wide range of sounds that people – and of machine hearing research is machines – may encounter in their computational auditory scene analysis everyday life has been far less studied. (CASA).[1] To the extent that we can Such sounds cover a wide variety of analyze sound scenes into separate objects, actions, events, and meaningful components, we can achieve communications: from natural ambient an advantage in tasks involving sounds, through animal and human processing of those components vocalizations, to artificial sounds that are abundant in today’s environment. retrieval aspects of that task, using Building an artificial system that standard acoustic representations. Here processes and classifies many types of we focus on the complementary sounds poses two major challenges. problem, of finding a good First, we need to develop efficient representation of sounds using a given algorithms that can learn to classify or learning algorithm. The current paper rank a large set of different sound proposes a representation of sounds that categories. Recent developments in is based on models of the mammalian machine learning, and particularly auditory system. Unlike many progress in large scale methods provide commonly used representations, it several efficient algorithms for this task. emphasizes fine timing relations rather Second, and sometimes more than spectral analysis. We test this challenging, we need to develop a representation in a quantitative task: representation of sounds that captures ranking sounds in response to text the full range of auditory features that queries. humans use to discriminate and identify different sounds, so that machines have a 2. Modeling Sounds chance to do so as well. To evaluate and In this paper we focus on a class of compare auditory representations, we representations that is partially based on use a real-world task of content-based models of the auditory system and ranking sound documents given text compare these representations to queries. In this application, a user enters standard mel-frequency cepstral a textual search query, and in response is coefficients (MFCCs). The motivation presented with an ordered list of sound for using auditory models follows from documents, ranked by relevance to the the observation that the auditory system query. For instance, a user typing “dog” is very effective at identifying many will receive an ordered set of files, sounds, and this may be partially where the top ones should contain attributed to the acoustic features that are sounds of barking dogs. Importantly, extracted at the early stages of auditory ordering the sound documents is based processing. We extract features with a solely on acoustic content: no text four-step process, illustrated in Fig. 1: annotations or other metadata are used at (1) A nonlinear filter bank with half- retrieval time. Rather, at training time, a wave rectified output. (2) Strobed set of annotated sound documents temporal integration, that yields a (sound files with textual tags) is used, stabilized auditory image (SAI). (3) allowing the system to learn to match the Sparse coding using vector quantization. acoustic features of a dog bark to the text (4) Aggregate all frames features to tag “dog”, and similarly for a large set of represent the full audio document. The potential sound-related text queries. In first two steps, filter bank and strobed this way, a small labeled set can be used temporal integration, are firmly rooted in to enable content-based retrieval from a auditory physiology and psychoacoustics much larger, unlabeled set. Several The third processing step, sparse coding, previous studies have addressed the is in accordance with some properties of problem of content-based sound neural coding and has significant retrieval, focusing mostly on the computational benefits that allow us to machine-learning and information- train large scale models. The fourth step takes a “bag of features” approach which is common in machine vision and information retrieval. The remainder of this section describes these three steps in detail.
Figure 2
2.2 Strobe finding and image
stabilization The second processing step, strobed temporal integration (STI), is based on human perception of sounds, rather than Figure 1 purely on the physiology of the auditory system. In this step, PZFC output is passed through a strobe-finding process, which determines the position of 2.1 Cochlear model filterbank “important” peaks in the output in each The first processing step is a cascade channel. These strobe points are used to filterbank inspired by cochlear dynamics initiate temporal integration processes in [2], known as the pole–zero filter cascade each channel, adding another dimension (PZFC) (Figure 2). It produces a bank of to represent time delay from the strobe, bandpass-filtered, half-wave rectified or trigger, points. Intuitively, this step output signals that simulate the output of “stabilizes” the signal, in the same way the inner hair cells along the length of that the trigger mechanism in an the cochlea. The PZFC can be viewed as oscilloscope makes a stable picture from approximating the auditory nerve’s an ongoing time-domain waveform. The instantaneous firing rate as a function of end result of this processing is a series of cochlear place, modeling both the two-dimensional frames of real- valued frequency filtering and the compressive data (a “movie”), known as a “stabilized or automatic gain control characteristics auditory image” (SAI) [3]. Each frame in of the human cochlea.[2] The PZFC also this “movie” is indexed by cochlear models the adaptive and frequency channel number on the vertical axis and dependent gain that is observed in the lags relative to identified strobe times on human cochlea, thereby making an the horizontal axis. Examples of such automatic gain control (AGC) system. frames are illustrated in Fig. 3 and Fig.4. specific scene in a video where a breaking glass can be heard. A similar task is “musical query-by-description”, in which a relation is learned between audio documents and words. We solve the ranking task in two steps. In the first Figure 3 step, sound documents are represented as sparse vectors. In the second step, we train a machine learning system to rank the documents using the extracted features. In this study we use the PAMIR method as a learning algorithm. PAMIR uses a fast and robust training procedure to optimize a simple linear mapping from features to query terms, Figure 4 given training data with known tags. The query is represented as a sparse vector of terms in the tag vocabulary (about 3,000 words), and each sound file is given a 2.3 Sparse coding of an SAI score respect to a query via a linear The third processing step transforms the matrix product of features times matrix content of SAI [3] frames into a sparse times query. The matrix is trained to code that captures repeating local optimize a ranking criterion, such that it patterns in each SA image. Sparse codes attempts to rank "relevant" documents have become prevalent in the higher, by giving them a higher score, characterization of neural sensory than "nonrelevant" ones, in the training systems. As such it provides a powerful set, for a large number of training representation that can capture complex queries that include multiword queries structures in data, while providing formed from the tag vocabulary. computational efficiency. Specifically, sparse codes can focus on typical 4 Experiments patterns that frequently occur in the data, We planned to evaluate the auditory and use their presence to represent the representation in a quantitative ranking data efficiently. task using a large set of audio recordings that cover a wide variety of sounds. We 3. Ranking sounds given text queries compare sound retrieval based on the We now address the problem of ranking SAI with standard MFCC features. In sound documents by their relevance to a what follows we describe the dataset text query. Practical uses of such a and the experimental setup system include searching for sound files or specific moments in the sound track 4.1 The dataset of a movie. For instance, a user may be We planned to collect a few thousands interested to find vocalizations of of sound effect from multiple sources. monkeys to be included in a presentation We planned to collect those from about the rain-forest, or to locate the commercially available sound effect collections, BBC sound effects library and through a variety of web sites; contained no other tag. We will use a www.findsounds.com, acoustica.com, second level of cross validation to ilovewavs.com, simplythebest.net, wav- determine the values of the hyper sounds.com, wavsource.com, and parameters: the aggressiveness wavlist.com. We plan to manually label parameter C, and the number of training all of the sound effects by listening to iterations. In general performance was them and typing in a handful of tags for good as long as C was not too high, and each sound. This was used for adding lower C values required longer training. tags to existing tags We selected a value of C = 0.1, which (fromwww.findsounds.com) and to tag was also found to work well in other the non-labeled files from other sources. applications and 10M iterations. From When labeling, the original file name our study the system is not very sensitive was displayed, so the labeling decision to the value of these parameters. To was influenced by the description given evaluate the quality of the ranking by the original author of the sound obtained by the learned model we can effect. We like to restrict our tags to a use the precision (fraction of positives) somewhat limited set of terms. We also within the top k audio documents from added high level tags to each file. For the test set as ranked for each query. instance, files with tags such as ‘rain’, ‘thunder’ and ‘wind’ were also given the 4.3 SAI and sparse coding parameters tags ‘ambient’ and ‘nature’. Files tagged The process of transformation of SAI ‘cat’, ‘dog’, and ‘monkey’ were frames into sparse codes [6] has several augmented with tags of ‘mammal’ and parameters which can be varied. We ‘animal’. These higher level terms assist plan to define a default parameter set in retrieval by inducing structure over and then performed experiments in the label space. All terms are stemmed, which one or a few parameters were using the Porter stemmer for English. varied from this default set. The default After stemming, we planned to have parameters cut the SAI into rectangles around 3000 tags. starting with the smallest size of 16 lags by 32 channels, leading to a total of 49 4.2 The Experimental Setup rectangles. All the rectangles were We planned to use standard cross reduced to 48 marginal values each, and validation [5] to estimate performance of for each box a codebook of size 256, for the learned ranker. Specifically, we like a total of 49 × 256 = 12544 feature to split the set of audio documents in dimensions. Using this default three equal parts, using two thirds for experiment as a baseline for training and the remaining third for comparisons, we can make systematic testing. Training and testing was variations to several parameters and repeated for all three splits of the data, studied their effect on the retrieval such that we obtained an estimate of the precision. First,we modify two performance on all the documents. We parameters that determine the shape of will remove from the training and the the PZFC filter: Pdamp and Zdamp.[6] test set queries that had fewer than k = Then, we modified the smallest rectangle 5 documents in either the training set or size used for sparse segmentation and by the test set, and removed the limiting the maximum number of corresponding documents if these rectangles used for the sparse segmentation. Further variants used systematic variation of the codebook that are most discriminative. Since our sizes used in sparse coding. system currently uses only features from short windows, we envision future work 5 Conclusion to incorporate more dynamics of the We described a content-based sound sound over longer times, either as a bag- ranking system that uses biologically of-patterns using patterns that represent inspired auditory features and more temporal context, or through other successfully learns a matching between methods. acoustics and known text labels. We described PAMIR to study References: systematically many alternative sparse- [1] R. F. Lyon, A. C. Kat siam is, and E. M. feature representations (“front ends”). Drakakis,"History and future of auditory filter models," in Proc. IEEE Int. Conf Circuits and Our analysis support the hypothesis a Systems, 2010,pp.3809-3812. front end that mimics several aspects of the human auditory system provides an [2] R. F. Lyon, "Filter cascades as analogs of the effective representation for machine cochlea," in Neuromorphic Systems hearing. These aspects include a realistic Engineering: Neural Networks in Silicon , T. S. Lande, Ed. nonlinear adaptive filter bank and a stage Norwell, MA: I(luwer, 1998, pp. 3-18. that exploits temporal fine structure at the filter bank output (modeling the [3] R. D. Patterson, K. Robinson, J. Holdsworth, cochlear nerve) via the concept of the D. McI(eown , C. Zhang, and M. Aller hand , stabilized auditory image. Importantly "Complex sounds and auditory images," in Proc. 9th Int. Symp. Hearing, Auditory Physiology and however, the auditory model described Perception, Y. Cazals, L. Demany, and Ie in this paper may not be always optimal, Horner,Eds. Oxford: Pergamon, 1992, pp. 429- and future work on characterizing the 446. optimal parameters and architecture of auditory models is expected to further [4] M. Slaney and R. F. Lyon, "On the importanceof time-A temporal representation of improve the precision, depending on the time," in Visual Representations of Speech task at hand. One approach to feature Signals, M. Cooke,S. Beet, and M. Crawford, construction would have been to Eds. New York: Wiley,1993, pp. 95-116. manually construct features that are expected to discriminate well between [5] D. Crangier and S. Bengio, "A neural network to retrieve images from text queries," in specific classes of sounds. For instance, Proc. Artificial Neural Networks-ICANN 2006, periodicity could be a good 2006, pp. 24-34. discriminator between wind in the trees and a howling wolf. However, as [6] M. Rehn, R. F. Lyon, S. Bengio, T. C. number of classes grows, such careful Walters,and C. Chechik, "Sound ranking using design of discriminative features may auditory sparse-code representations," in ICML Workshop Sparse Methods for Music Audio, become infeasible. Here we take an 2009. opposite approach, assuming that perceptual differences rely on lower level cochlear feature extraction, we proposed the models inspired by cochlear processing to obtain a very high dimensional representation, and let the learning algorithm identify the features