Sie sind auf Seite 1von 10

Audio Engineering Society

Convention Paper
Presented at the 130th Convention 2011 May 1316 London, UK
The papers at this Convention have been selected on the basis of a submitted abstract and extended precis that have been peer reviewed by at least two qualied anonymous reviewers. This convention paper has been reproduced from the authors advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society.

A comprehensive and modular framework for audio content extraction, aimed at research, pedagogy and digital library management
Olivier Lartillot1
1

Finnish Centre of Excellence in Interdisciplinary Music Research, University of Jyvskyl, 40014, Finland a a

Correspondence should be addressed to Olivier Lartillot (olartillot@gmail.com) ABSTRACT We present a framework for audio analysis and the extraction of low-level features, mid-level structures and high-level concepts, altogether studied as a fully interwoven complex system. Composite operations are constructed via an intuitive programming language on top of Matlab. Datasets of any size can be processed thanks to implicit memory management mechanisms. The data structure enables a tight articulation between signal and symbolic layers in a unied framework. The resulting technology can be used as a pedagogical tool for the understanding of audio, speech and musical processes and concepts, and for content-based discovery of digital libraries. Other applications includes intelligent browsing and structuring of digital library, information retrieval, and the design of content-based audio interfaces.

This paper introduces an open-source project for signal processing, that includes a new simple and adaptive syntactic layer on top of the Matlab environment for the control of high-level operators. The Mining Suite is a complete redesign of our previous framework MIRtoolbox [1] initially focused on the domain of Music Information Retrieval that can be applied to extra-musical domains as well. This software tool is the product of an ongoing

research project presented in section 1 related to the extraction of three layers of content from audio and music. The comprehensive scope of the investigation requires a highly modular methodological framework, which is discussed in section 2. The main characteristics of the framework are described in section 3 and examples of operators and their applications are given in section 4. Particular memory management capabilities are discussed in section 5

Lartillot

A comprehensive and modular framework for audio content extraction

and concrete application areas are evoked in section 6. The framework is oered to the scientic community as an open-source project, as explained in section 7. 1. A COMPREHENSIVE FRAMEWORK

We present a general framework for the analysis of audio recordings aimed at the extraction of content applied in particular to music analysis that we propose to locate along three layers of representation. On the lowest layer, a large range of descriptions of sounds (mainly timbral, but also simple representations of rhythm and tonality, for instance) are based on standard signal processing operations (spectral analysis, envelope extraction, etc.), as shown in the features part of Table 1. On the middle layer, the signal is structurally organised through the emergence of symbolic events (notes, speech utterance, etc.) as well as more elaborate structures, such as groups of events, phrases, structural parts, etc. In the current version of our released framework, this corresponds to event detection (events and notes in the structures part of Table 1) and to the use of novelty curve (novelty) based on post-processing of similarity matrix (simatrix). Signicant research is being carried out for the development and integration of new innovative methods of structural analysis.1 Finally, high-level concepts are inferred, such as musical genres, speech styles, or other semantic categories, as well as emotional classes. Current version of the framework includes emotion [3, 4]. The large scope of this framework (and of the underlying research study) is, in our view, generally benecial for the problem of content extraction, and even necessary. Indeed, oering a general overview of techniques for content extraction enables to build
1 The MiningSuite will also includes PatMinr, a package dedicated to the automated detection of pattern repetitions in symbolic sequences [2], which will be tightly integrated with the other packages presented in this paper.

a comprehensive state of the art, and fosters the improvement and combination of approaches. Besides, we support the opinion that detailed understanding of music, in particular requires a careful taking into consideration of the complex interdependencies between the various representational dimensions, such as pitch, rhythm or structure. The proposed framework enables a complete modelling of such complex interdependencies. Whereas the extraction of low-level features can often reside on purely signal processing methods, the determination of middle-level structures and highlevel concepts requires an interdisciplinary collaboration between psychoacoustics, cognitive science (in particular, auditory scene analysis, in order to make explicit the rules of the emergence of notes, complex polyphonies and structures), articial intelligence, social science and neuroscience.2 2. A MODULAR ARCHITECTURE In the proposed framework, analytical processes are conceived, not as a succession of low-level commands, but as a owchart composed of high-level customisable operators. Corresponding to distinct and clearly dened representations of signal, sound and music, these building blocks can be combined in many ways and oer a variety of options. As shown in Table 1, operators are organized into packages corresponding to separate domains of study: signal processing (SigMinr ), audio analysis and auditory modeling (AudiMinr ) and music analysis (MusiMinr ). 2.1. A Coherent Integration of Expertises Certains operators oer separate expertise in each particular domain that can be combined:
2 For instance, a current collaboration with experts in intercultural music studies (in particular, Mondher Ayari, University of Strasbourg), music perception and cognition (in particular, Stephen McAdams, McGill University, and Petri Toiviainen, University of Jyvskyl) has amongst objective a a to reveal the complex interactions between the perception of structural aspects of music and the cultural background. This collaborative project, called Creativity, Music, Culture, is funded by the French research agency ANR for the years 2011-2013.

AES 130th Convention, London, UK, 2011 May 1316 Page 2 of 10

Lartillot

A comprehensive and modular framework for audio content extraction

SigMinr input spectrum filterbank zerocross flux stat cepstrum envelope peaks events simatrix novelty

AudiMinr features input spectrum filterbank brightness roughness mfcc pitch envelope structures events attack

MusiMinr input spectrum filterbank

tone [7]. The MusiMinr module adds particular lterbank decompositions that optimize musically oriented operations, such as note pitch extraction. envelope extracts the amplitude envelope of the signal, with the help of a large range of options such as Hilbert transform, down- and upsampling, various low-pass lters, spectrogram, wave rectication, logarithm, dierentiation, etc. AudiMinr oers some auditory models: for instance, the decomposition of the input signal into a bank of Gammatone lters, followed by envelope extraction on each band and sum back; or particular auditory mechanisms such as the mu-law [8]. The envelope is used in AudiMinr for event detection and in MusiMinr for beat and tempo estimation and note detection. emotion evaluates the emotional content of the recordings along various emotional dimensions and classes. We built a model based on musical recordings [3], but it seems that a part of the emotional content stems from extra-musical audio characteristics (timbral, energetic, etc.) so a version for the AudiMinr module is under investigation. 2.2. Advantages of Modularity The modular conception of signal processing oers particular advantages: The design of new analytical processes can be conceived directly as a succession of high-level operators, where low-level minor technical considerations can be hidden, saving thus a signicant amount of eort, as many MIRtoolbox users reported, exempted from developing redundant codes by themselves. This fusion of pluridisciplinary scientic expertise yields rich operators with a large collection of options. This schematism stimulates a modular vision of analytic processes and the development of highlevel operators. For instance, the integration of spectral ux required the design of a particular operator for flux operations, that turned

pitch tempo key, mode notes beats tempo key, mode emotion

concepts emotion

Table 1: Overview of operators in The MiningSuite.

input simply loads the data from an input le, and oers ltering tools to select particular temporal regions, particular channels, etc. SigMinr accepts various audio format such as WAV or MP3, whereas MusiMinr can load symbolic music representations such as MIDI les. spectrum displays the distribution of energy along frequencies. In the SigMinr module, it corresponds to the FFT operation but also to the autocorrelation function3 ; the AudiMinr module integrates perceptual modeling such as Terhardts outer-ear ltering [5], resonance curves that emphasize frequencies that are more easily perceived [6], auditory scales such as Mel and Bark bands and masking eects in critical bands; the MusiMinr module considers the decomposition of the energy into cents and along musical scales, corresponding to chromagram representation. filterbank decomposes the temporal signal into frequency bands. The AudiMinr module oers particular decompositions that correponds to auditory modelling, such as Gamma3 Cf.

section 4 for some more technical details.

AES 130th Convention, London, UK, 2011 May 1316 Page 3 of 10

Lartillot

A comprehensive and modular framework for audio content extraction

out to be fruitful in the conception of advanced musical analytical processes.

In order to perform multiple post-processing operations, a single operator can be called several times successively: a = sig.spectrum(myfile) b = sig.spectrum(a,Max,5000) c = sig.spectrum(b,Log) or only once with all the options enumerated as successive keywords (in any order): sig.spectrum(myfile,Max,5000,Log) By threading successive operations, complex owcharts returning multiple outputs can be designed easily, enabling a factorization of operations and the suppression of redundant computation: a = sig.spectrum(myfile) b = audi.brightness(a) c = audi.mfcc(a) 3.2. Data Decomposition

3.

MAIN DESIGN PRINCIPLES

The operators are available via an innovative language as an additional layer on top of the Matlab language. The maximal simplicity of its syntax helps users to concentrate on the chain of operations leading to the desired analytical process. The low-level technical considerations can be modied via options associated to those operators, or simply ignored. The output data encapsulates all required technical details, such as sampling rate. 3.1. An Intuitive and Flexible Language In The MiningSuite syntax, any command consists of the name of an operator, followed, as arguments, by the name of the input, a le name, with or without its extension, for instance: sig.input(filename.wav) and by keywords related to options, when desired: sig.input(filename,Sampling,44100) An operator can be applied automatically to all valid les in a given directory by simply writing Folder as rst argument, or even to subdirectory recursively, keeping trace of the complete directory structure, by using the Folders keyword. The output can be stored in a variable, which can be send as input to another operator, and so on: a = sig.input(myfile) b = sig.spectrum(a) c = sig.cepstrum(b)

Various methods for decomposing and recombining signals are unied into a single framework. 3.2.1. Frame Decomposition Frame decomposition is performed by simply adding the Frame keyword as an additional argument of the operator where this decomposition needs to be performed. For instance when simply calling one operator: musi.tempo(myfile,Frame) In this case, the decomposition follows the default frame conguration associated to tempo estimation. These parameters can be changed as well. For instance, for a frame size of 3 seconds and a hop factor of 1 second: musi.tempo(myfile,Frame,3,s,1,s)

Instead of specifying a list of operations, it is possible to call only the nal operator corresponding to the desired output: sig.cepstrum(myfile)

The decomposition process is implicitly integrated into the operators at the most suitable places. For instance, analyzing the temporal evolution of tempo requires a frame decomposition after the envelope extraction.

AES 130th Convention, London, UK, 2011 May 1316 Page 4 of 10

Lartillot

A comprehensive and modular framework for audio content extraction

Any operator called with an input argument that has been previously frame decomposed is automatically applied to each frame separately. Hence the previous operation is roughly equivalent to the following script: a = b = audi.envelope(myfile,... Frame,3,s,1,s) musi.tempo(a)

separate note of the previous example, and the result can be save in a multi-channel MIDI le. a = audi.filterbank(myfile) b = musi.note(a) c = musi.pitch(b) musi.save(c,output.mid) 3.3. Multi-Layer Representation 3.3.1. Symbolic Inference The MiningSuite includes a set of analytical tools that highlight and select particular points in the input signal: peaks features a large range of peak picking options. They have usually been developed for particular domains (periodicity analysis, envelope extraction, spectral analysis, etc.), but their integration into a single interdisciplinary module enables once again a productive share of knowledge between communities of research. In particular, peaks oers the possibility of tracking peaks along successive frames4 . events highlights particular temporal regions corresponding to candidate events, based on various strategies (silence, local discontinuities, novelty, etc.). A complex event can be decomposed into a series of sub-events. Whereas events is a general operator integrated in AudiMinr, notes is a specialization in MusiMinr for the detection and characterization of elementary musical events. A note can be decomposed into sub-events, corresponding for instance to attack phase, sustain phase and release phase. 3.3.2. Symbolic Layers This complex network of information produced by these analytical operators forms symbolic layer(s) superposed on top of the original audio layer. These symbols (events, notes, etc.) point to particular temporal regions of the signal and are described by a set of parameters. In the musical context, this layers can be organized for instance in the following way:
4 Cf.

It is also possible to combine several layers of frame decomposition, where the large frames (for instance 5-second long are further) are decomposed into small frames (lets say via a .1-second long and half overlapped spectrogram): a = b = c = d = sig.input(myfile,Frame,5,s) sig.spectrum(a,... Frame,.1,s,.05,s) sig.flux(b) sig.stat(c)

In that example, the statistics (stat) are computed within each large frame separately. Equivalently the large frames can result from a recombination of small frames: a = sig.spectrum(myfile,Frame) b = sig.flux(a) c = sig.stat(b,Frame,5,s) 3.2.2. Channel Decomposition Any operator called with an input argument that has been previously decomposed into separate channels using filterbank is automatically applied to each channel separately. a = audi.filterbank(myfile) b = audi.envelope(a) c = sig.sum(b) In the following example, each channel contains independent notes, such that any further operator will be applied to each note of each channel separely. For instance, the pitch content can be estimated for each

section 4 for some more technical details.

AES 130th Convention, London, UK, 2011 May 1316 Page 5 of 10

Lartillot

A comprehensive and modular framework for audio content extraction

A note layer containing note events. A MIDI le, for instance, once loaded using musi.input, is translated into a symbolic layer containing notes, without any audio layer underneath. The symbolic layer can be either a simple succession of events, or a far more complex graph where events enter into relations of temporal succession and/or superposition, and where branches represent channels of variable durations. In the note layer, this correspond to the musical concept of polyphony, in its most general sense. A metric layer containing a hierarchy of beat, i.e., pulsations. A structure layer containing a complex conguration not necessarily hierarchical of structures encompassing notes and structures of structures, etc. 3.3.3. Articulation Signal/Symbolic Layers A large range of signal processing methods included into the framework are generalized to this heterogenous and multi-modal data representation. In MIRtoolbox, audio could be segmented into a sequences of successive non-overlapping parts, onto which further analyses could be carried out automatically. In The MiningSuite, this is generalized by allowing not only segments but any construction of symbolic events, related to particular temporal regions of the input signal. For instance, musical recording can be analyzed by focusing particularly on the parts of the signal related to actual notes, and discarding transient events. This data structure also allows tight interconnections between several input signals, such as a purely symbolic representation of a piece of music (a score) and one or several recordings of the same piece. The interconnection can consist of an alignment of the note or metric layer of each input signal. This will enable for instance the study of temporal uctuation in various musical performance, or more general comparisons, and advanced analyses of the audio input guided by the referential score. 3.4. Observation of the Results Any result of the analysis process (the nal output or any intermediary step) can be immediately visualized by a graphical display of its content within a

dedicated gure window. A large range of output can also be directly sonied using a play method. This enables to hear in particular the successive frames of a moving window, the separate channels of a lterbank decomposition, the successive segments of a segmentation, the shape of an envelope (modulating a white-noise), as well as pitch, beats, chromagram, etc. A database can also be quickly browsed in the form of audio snippets played in ascending order of a specied extracted feature. Statistical and data-mining post-processing operations can also be applied directly, using dedicated operators or through exportation to other software. 3.5. Quality Checking Each operator in The MiningSuite keeps a record of the list of operations subsequently performed on the input. In this way, it is possible for each result that has been stored to trace back the complete list of operations, for verication and quality checking purpose. As part of the development quality requirements, each operator in the The MiningSuite includes a range of tests checking whether the operator is used coherently, and reasonably connected to the whole owchart. Tests also check that the input data fulll particular requirements ensuring the good quality of the analysis. Warning and error messages are displayed if the quality of the result cannot be assured. For instance, tonal analysis in music required a suciently high frequency resolution of the spectral decomposition on a particular range of frequencies. If the input data does not meet this constraint, compromises are automatically made and a specic warning message is displayed. 4. EXAMPLES OF OPERATORS

4.1. Examples of Signal Processing Operators The computing of autocorrelation function (one method integrated in the spectrum operator) can benet from improvements developed in various areas. Side-border distortion can be suppressed by dividing the autocorrelation with the autocorrelation of its window (preferably Hanning) [9]. A magnitude compression of the amplitude decreases the width of the peaks in the autocorrelation curve, suitable for

AES 130th Convention, London, UK, 2011 May 1316 Page 6 of 10

Lartillot

A comprehensive and modular framework for audio content extraction

Similarity matrix temporal location of frame centers (in s.) 14 12 10 8 6 4 2 2 4 6 8 10 temporal location of frame centers (in s.) Novelty 12 14

1 coefficient value

0.5

5 10 Temporal location of events (in s.)

15

Fig. 1: Peak picking and tracking. Fig. 2: Similarity matrix and corresponding novelty curve. instance for multi-pitch extraction [10]. The subharmonics implicitly included in the autocorrelation function can be tentatively suppressed on the waverectied output by substracting time-scaled versions of the output. The standard normalization of autocorrelation function (that forces a value of 1.0 at zero lag) for a multi-channel input requires a taking into account of the relative amount of energy along channels. Peak picking (peaks operator) can use an adaptive threshold: a given local maximum will there be considered as a peak if its distance with the previous and successive local minima (if any) is higher than a specied threshold, expressed with respect to the total amplitude of the input signal. This methods proves to oer quite reliable results in many applications, and has been used extensively in MIRtoolbox. It is also possible for instance to automatically extract the lobe of each peak and compute statistical moments (centroid, spread, etc.) for each of these peaks separately. In frame-decomposed data, peaks can be tracked along time, by connecting successive peaks that are suciently aligned (cf. Figure 1). This method, initially developed for speech analysis [11], has been used in MIRtoolbox initially for the tracking of spectral harmonics. We recently improved the method by allowing gaps between successive peaks, and generalized to the creation of whole graph of connections between peaks. A signal can be automatically segmented into a se-

ries of homogeneous sections (which is considered as one possible kind of events) through the estimation of discontinuities along temporal evolution of particular features. This is estimated as follows: First of all, feature-based distances between all possible frame pairs are stored in a similarity matrix (simatrix, cf. Figure 2). Convolution along the main diagonal of the matrix using a Gaussian checkerboard kernel yields a novelty curve that indicates the temporal locations of signicant textural changes [12]. Peak detection returns the temporal position of feature discontinuities that can be used for the actual segmentation of the audio sequence. 4.2. Examples of Auditory Modeling Applications Timbral characterization of sounds include the description of attack of events (for instance through the computation of the attack slope in the envelope curve, cf. Figure 3), of sound brightness [13], the computation of Mel-Frequency Cepstral Coefcients (mfcc), the characterization of roughness based on beating eects between components close in frequency [14], etc. 4.3. Example of Music Analysis Applications Concerning rhythmic analysis, tempo is estimated through the estimation of periodicities in the amplitude envelope curve5 . Particular frequency regions
5 This envelope curve can be computed both from audio signal and note representation [6].

AES 130th Convention, London, UK, 2011 May 1316 Page 7 of 10

Lartillot

A comprehensive and modular framework for audio content extraction

Self!organizing map projection of chromagram

Onset curve (Envelope) 1 amplitude 0.5

f# A c# a E e

d F f C c G b D g Bb Eb

bb Db

3 time (s)

Ab ab

Fig. 3: Envelope curve with events and their related attack phases.

d# Gb

more easily perceived are emphasized by applying a cognitively-based resonance curve. The same analysis can be performed directly from symbolic data by summing Gaussian kernels located at the onset points of each note, the height of each Gaussian kernel being proportional to the duration of the respective note. In both cases, diverse descriptions of the resulting autocorrelation curve (global minimum or maximum, kurtosis of main peak, etc.) lead to an assessment of pulse clarity[15]. Concerning tonal analysis in the audio domain, the spectrum is converted from the frequency domain to the pitch domain by applying a logfrequency transformation. This distribution of energy along pitches, called chromagram, and included in musi.spectrum, is then wrapped, showing the distribution of energy with respect to the twelve possible pitch classes (C, C#, D, etc.). The tonality is assessed by comparing, through cross-correlation, the chromagram to a theoretical pitch distribution associated to each possible tonality [16]. The most prevalent tonality is considered to be the key candidate. A richer representation of the tonality estimation can be drawn with the help of a self-organizing map (SOM), trained by the 24 tonal proles. Key is estimated by projecting its wrapped chromagram onto the SOM [17] (cf. gure 4). 5. MEMORY MANAGEMENT

Fig. 4: Self-organized map related to key structures.

owchart design, which can be subsequently applied to any input data afterwards. The abstract owchart is progressively constructed using the same minimalist syntax presented in paragraph 3.1. The only different is that instead of specifying a specic input le, the keyword Design is used. For instance: a = sig.spectrum(Design,Frame) b = sig.struct b.brightness = audi.brightness(a) b.mfcc = audi.mfcc(a) The owchart (here, b) can then be evaluated on particular le or folder(s) of les: sig.eval(b,Folders) One particular interest (and importance) of this approach is to allow the application of such owchart on large set of audio les, or on long single audio les. The decomposition of the input into chunk of reasonable size, as well as the recombination of the results for these successive chunks, are automatically taken into consideration by the underlying code. Thanks to these implicit memory management processes, dataset of any size can be processed automatically. 5.2. Architecture Details The adaptiveness of the operator syntax, the acceptation of various input types as well as the capability to analyze big databases and long audio les without memory overow, impose a clarication and unication of the operator code structure, which are divided into three main phases:

Datasets of any size can be processed thanks to implicit memory management mechanisms. 5.1. Dataow Design and Evaluation The input of an operator, or of a series of operations, or of more complex owcharts, should not necessarily be connected to one particular input data (le, set of les, etc.), but can be stored as an abstract

AES 130th Convention, London, UK, 2011 May 1316 Page 8 of 10

Lartillot

A comprehensive and modular framework for audio content extraction

An initialization phase connects the given operator to the general dataow: it unrolls the series of preliminary operators implicited called by the operator before the specic set of operations strictly related to the operator itself. For instance, tempo(myfile) unrolls the list of preliminary operations necessary for the key extraction: envelope, spectrum, etc. The core operations perform the essential important aspects related to the operator, which generally reduces the amount of data available from the input, and output the data in the expected new format. The post-processing operations apply a certain number of operations on the output data. This distinction is all the more important when considering the analysis of long audio les, which requires the decomposition of the input data into successive chunks: The initialization enables to draw the complete dataow, which is used as basis for specifying the characteristics of the chunk decomposition, knowing the underlying memory requirements and the available resources. Each successive chunk is passed to the core operations, where signicant data reductions usually take place. The output can then be recombined chunk after chunk, through either concatenation, summation, averaging, etc., depending on the operator, forming a single output signal in the end. The post-processing operations can then be performed directly on the whole data. The proposed management of memory enables to avoid the use of temporary les, which remains however available for the most demanding cases. For example, the possible use of zero-phase ltering in envelope extraction requires two scans of the input data, in both directions. Advanced techniques have been added to the framework, such as the correction of rounding errors that could happen if resampling is performed in each chunk separately.

6.

APPLICATION AREAS

Three main applicational areas are considered. 6.1. Perceptive and Cognitive Modeling First, in a purely scientic point of view, the resulting framework oers a detailed explanation of perceptive and cognitive mechanisms underlying the perception of content from audio. 6.2. Pedagogical Tool Secondly, the resulting tool has a large range of pedagogical potentials (as we already experienced in the previous version of our platform, MIRtoolbox, with thousands of users from the academy world), oering a very intuitive language for the experimentation of signal processing and the extraction and visualisation of content of all kinds. This technology oers not only the possibility of understanding the nature of those types of content and the way they are extracted, but it also oers rich perspectives of digital libraries along those selected dimensions of representation under interest, such as musical genres, emotions, musical concepts, etc. 6.3. Digital Library Management A third application area concerns the use of the proposed framework for intelligent browsing and structuring of digital library, information retrieval, and the design of content-based audio interfaces. 7. THE MININGSUITE

The proposed framework, called The MiningSuite, is oered to the signal processing community as an open-source project with source repository, discussion lists for users, developers, commits, wiki pages, etc.6 The Matlab environment oers advanced capabilities: Currently, analyses can be performed in parallel on dierent processor cores and on cluster of computers, GPUs can be used as well. In future works, we plan real-time versions, platform-specic compiled versions, web services, etc. The MiningSuite being an open-source project, the code source related to all these operators is freely
6 The project website can be accessed at the following address http://code.google.com/p/miningsuite

AES 130th Convention, London, UK, 2011 May 1316 Page 9 of 10

Lartillot

A comprehensive and modular framework for audio content extraction

available. A set of development quality requirements encourages in particular to consider the building block represented by each operator as an actual white box, where the complete code is clearly structured and suciently commented in order to foster close collaboration between developers and users via the open source community network. This ensures a better control of code errors and a faster development of new features. A Software Development Kit oers users the possibility of developing their own operators: metafunctions hide all the aforementioned complex mechanisms, so that operators can be designed and coded using very simple templates. 8. REFERENCES

Perception, edition by Y. Cazals et al, Oxford, 1992, 429446 [8] A. Klapuri, A. Eronen and J. Astola, Analysis of the meter of acoustic musical signals, IEEE Transactions on Audio, Speech and Langage Processing, 14 (2006), no. 1, 342 355 [9] P. Boersma, Accurate short-term analysis of the fundamental frequency and the harmonicsto-noise ratio of a sampled sound, Institute of Phonetic Sciences Proceedings 17 (1993), 97 110 [10] T. Tolonen and M. Karjalainen, A Computationally ecient multipitch analysis model, IEEE Transactions on Speech and Audio Processing, 8 (2000), no. 6, 2000, 708716 [11] R. McAulay and T. Quatieri,Speech analysis/Synthesis based on a sinusoidal representation, IEEE Transactions on Acoustics, Speech and Signal Processing,34 (1996), no. 4, 744754 [12] J. Foote and M. Cooper, Media Segmentation using Self-Similarity Decomposition, Storage and Retrieval for Multimedia Databases, SPIE Proceedings 5021, 16775. 2003 [13] P. N. Juslin, Cue utilization in communication of emotion in music performance: relating performance to perception, Journal of Experimental Psychology: Human Perception and Performance, 26 (2000), no. 6,, 17971813 [14] W. A. Sethares, Tuning, Timbre, Spectrum, Scale, Springer-Verlag, 1998 [15] O. Lartillot, T. Eerola, P. Toiviainen and J. Fornari, Multi-feature modeling of pulse clarity: Design, validation, and optimization, presented at the 9th International Conference on Music Information Retrieval, Philadelphia, US, 2008 September 14-18. [16] E. Gomez, Tonal description of music audio signal. Phd thesis, Universitat Pompeu Fabra, Barcelona, 2006. [17] P. Toiviainen and K. Krumhansl, Measuring and modeling real-time responses to music: The dynamics of tonality induction, Perception 32, no. 6 (2003), 741766

[1] O. Lartillot and P. Toiviainen, A Matlab toolbox for musical feature extraction from audio, presented at the 10th International Conference on Digital Audio Eects, Bordeaux, France, 2007 September 1015 [2] O. Lartillot, Multi-dimensional motivic pattern extraction founded on adaptive redundancy ltering, Journal of New Music Research, 34 (2005), no. 4, 375393 [3] T. Eerola, O. Lartillot and P. Toiviainen, Prediction of Multidimensional Emotional Ratings in Music From Audio Using Multivariate Regression Models, presented at the 10th International Conference on Music Information Retrieval, Kobe, Japan, 2009 October 26-30 [4] P. Saari, T. Eerola and O. Lartillot, Generalizability and simplicity as criteria in feature selection: Application to mood classication in music, IEEE Transactions on Audio, Speech, and Language Processing, in press, TASL.2010.2101596 [5] E. Terhardt, Calculating virtual pitch, Hearing Research, 1 (1979), 55182 [6] P. Toiviainen and J. Snyder, Tapping to Bach: Resonance-based modeling of pulse, Music Perception, 21 (2003), no. 1, 4380 [7] R. D. Patterson et al, Complex sounds and auditory images, in Auditory Physiology and

AES 130th Convention, London, UK, 2011 May 1316 Page 10 of 10

Das könnte Ihnen auch gefallen