Sie sind auf Seite 1von 13

NIT KURUKSHETRA

SEMINAR REPORT
MUSIC EMOTION RECOGNITION

Submitted to: Submitted By:


Mr. Satender Jaglan Rajesh
11610410
EC-3
MUSIC EMOTION RECOGNITION USING SUPPORT VECTOR CLASSIFICATION
Rajesh (11610410, EC3)

Abstract: the described method used for Music Emotion Recognition will try to deduce the inherent
emotion that the music is trying to convey. This knowledge in turn can be applied in many emotions
based applications.
With such an outburst of music, coming out in such a large multitude, there is a need to classify
them in a most basic way, i.e. the feeling or emotion they instil in the listener. MER is an
interdisciplinary study that spans not just Signal Processing and Machine Learning, but also has its
outreach in psychological avenues too, about how emotions actually are, how they are perceived,
are the current parameters to define emotion universal, or that everyone has a different perception
towards what those feelings are.

The challenge is that there is no well-developed emotion model for music emotion description.
Various methods have been tried to scan the inherent Music Emotion, however, the accuracy is not
yet satisfactory. In this method, a 2 level classification system has been implemented, which is
based on the music genre and the extracted features, thereby utilising the most suitable acoustic
information.
For verification, the consistency between the music features and ground truth has been measured

Keywords:
Arousal, Valence, Support Vector Machines, RRiliefF.

Introduction:
Imagine going home all worked up, and then there is a calm melody playing, or getting home after a
huge victory and some rocking music is being played. This is what an application of MER is capable
of doing. Detecting a person’s emotion and accordingly playing the apt. music.
However, unlike videos, deducing the emotions from the audio is pretty difficult. Since very less
features are strongly linked to arousing emotions in an individual. Also, since everyone’s perception
is different, a same music can be perceived to be having aroused different emotions in different
individual. This subjectivity makes the MER all the more difficult. It is not easy to describe and label
emotions in a universal way, since the adjectives used for defining them are nebulous.
In order to extrapolate the emotions from the music, a number of acoustic characteristics like
timbre, intensity and rhythm are carefully analysed. The ambiguity has been somewhat resolved by
the use the arousal and valence parameters. The main aim of the MER techniques is to put the song
as a coordinate on the Thayer’s Arousal Valence plane a.k.a. THE AV PLANE.

Thayer’s AV Plane:

The plot shows the different emotional states in terms of positivity and the energy. The previously
defined acoustic features are used to analyse and study the level of positivity and energy in a song,
using several tools like Support Vector Classification, R relief F, also the ground truth.

Previous Methods:
1. AV Modelling:
The techniques was used to detect the emotion variations in a video sequence. A
number of features were observed and their weighted sum was used to predict the
value of Arousal and Valence.
Those features included motion vectors between each video frame, change in shot
lengths, energy of sound, to compute arousal,
While for valence, pitch of sound was used.

2. FUZZY Approach:
Emotions are divided into four quadrants. Four different parameters, corresponding to
each of the 4 quadrants is then assigned to the input music sample which signifies the
relative strength of that particular emotion class.
a= u1+u2-u3-u4
v=u1+u4-u2-u3
Fuzzy approach uses the geometric relationship between arousal and valence, which in
turn is inexact. So, applying arithmetic operations for AV computation might not be the
best approach.
3. System Identification Approach:
Here, for the input music sample, segments are generated every second, for which
ground truth is established [For data validation] as well as 18 features are extracted
each segment. This tries to generalize the emotional content, but it works on the
temporal relationship between the music segments, which might not always be
consistent.

Out of all the above methods: System ID approach has somewhat highest accuracy, with 78.4% for
Arousal and 21.9% for Valence.
Discussed Method:
1. Regression Approach:
Regression theory is employed to process the facts, observe the variables or the features,
and then try to predict a real value from the test cases. It is a rather easy performance
analysis and also is easy to optimize.

Following factors are taken into account:


1. Domain of R:
Coordinates in the Thayer’s AV plane.
2. Ground Truth :
It encompasses the individual’s response.
3. Feature Extraction:

Different features are extracted from the music sample such as timbre, bass etc.
4. Regression Algorithm:
Different regression algorithms such as K nearest neighbor, SVM should be used and the most
accurate one should be adopted.

5. Number of regressors:
2 regressors are required RA and RV for the 2 parameters i.e. arousal and valence.
6. Training fashion:
It should be observed that if the regressors are giving better prediction when considered
independent or when they are treated as interdependent values.

Tools Used:
1. Uniform Format:
Songs are broken into segments, each of 25 second time frame, also they are converted to
a uniform format of 22050 Hz, 16 bits and mono channel PCM WAV and normalized to the
same volume level. For every segment, the emotion is inferred. The emotion of the song is
then concluded as an average of the emotions conveyed by those segments.
2. Music Information Retrieval:
The low level stimuli like tempo, pitch, loudness and timbre are effective in finding the
arousal, while for deducing the valence, mode and harmony are useful. However, more
complex parameters such as inharmonicity, roughness, pulse clarity should also be taken in
account. All in all, 4 main categories: Rhythm, Timbre, Tonality and Dynamics encompasses
the 35 extracted features.

For this, spectral contrast algorithm, DWCH algorithm, PsySound and Marsyas are used.
PsySound is found to be the one that extract the maximum no. of features that are
strongly related to arousing emotions.

3. Subjective test:
About 250 random volunteers were asked to listen to any 10 random music samples from
the database and to rate them in terms of AV values, spanning from -1.0 to 1.0.
They were asked to:
a. To label the emotion that the song tried to convey, not the one that they perceived.
b. Express their general response towards the overall song i.e. including the lyrics, the
melody and the singing.
c. They were given as much time as they wanted.

This quality of this step is the key to performance analysis.

4. RReliefF:
The extracted features are not necessarily all equally important.
Some of them might convey redundant information and lead to inaccurate conclusion. An
FSA, that is, a Feature Selection Algorithm is then used to find the good features from the
Feature Space which will give the maximum prediction accuracy.
For instance, energy related features are not of much relevance due to the normalization
of sample sound volume.
Performance Evaluation
Different feature spaces, ground truths, and regression algorithms are experimented with,
and then compared in terms of R2 Statistics, which is a standard way for measuring the fitness of a
model for regression. An R2 value of 1 means that the model is an absolute hit, while negative value
signifies that the model is worse than simply taking the average value.

A ten-fold cross validation is used to check the performance of the regression model. This means
that the dataset is divided into 10 sets. 9 are used as training set, 1 is used as a test set. Selection of
these sets are random in nature. This process is repeated 20 times, for each of which R 2 value for
arousal and valence are then calculated separately.
Ground Truth Validation:
The ground truth is cross checked in two ways:
A. the annotations by different volunteers for the same song is compared to the averaged
annotation. Higher the standard deviation, lesser is the confidence in the ground truth.

B. it is also checked, whether the annotations given by the same subject for the same song. The
subject is asked to give annotations again about 2 months after the first test. The fact that if the
absolute difference in annotations falls below 0.1 cements the confidence in the ground truth.

SUPPORT VECTOR MACHINE

It is a Supervised ML algorithm that is used mainly as a classification tool. Here, each


point is plotted as a coordinate in an n dimensional plane, n being the number of
features. The main aim is to find a HYPERPLANE that demarcates the 2 classes in the
best way possible.
Methodology:
1. Finding the apt hyper-plane:
a. Scene 1:

We have to find a hyper-plane that best segregates the features. Here, B is the most suitable one.
b. Scene 2:
Now that we have current orientation of the hyper-plane, we have to select one such
that it has the maximum margin. A margin is defined as the distance of hyper-plane from
either of the 2 distinct data points.

Here, C has the highest margin. Higher the Margin, more robust is the model. Hence lower is the
chance for miss calculations.
c. Scene 3:

In such case where the data points cannot be totally segregated, the most isolated point (here, a star) is
classified as an Outlier class. The algorithm is programmed to cast out the outlier classes by focussing on the
region with the highest concentration of the data points.
d. For non-linear classification:
For a non-linear feature space where the features can’t be segregated using in a 2 dimensional
space, we can add another feature to the plane, i.e. another parameter to transform the feature
plane into such a higher dimensional plane such that the separation between the points can be
achieved. We don’t need to specifically apply these features manually to the space. The SVM is
equipped with a Kernel Trick that keeps on applying features automatically that converts the
linear problem into the non-linear one, and separates the feature points in the best possible
way.
The SVM can be implemented efficiently using Python (more specifically, the scikit learning library) or R
language.

Pros:

1. Works perfectly with clear separation margin.


2. Very effective for high dimensional spaces.
3. Since it uses a subset of training points in the support vectors, hence is more memory efficient.

Cons:
1. Inefficient in case of a large data set since the training time is fairly large.
2. The efficiency reduces in case of noise, i.e., when target classes are overlapping.
3. Expensive five or ten cross validation has to be used to calculate the probability estimates.

Applied approach for the MER:

Here, the acoustic features are extracted and directly applied to the two level Support Vector Classifier. The
main purpose/ the key idea of using two classifiers in place of a single one is that segmenting the classifiers
will in turn produce better classification results, hence increasing the fidelity of the system.

The first classifier will be deployed to identify the genre of the song under test, while the second classifier
will then finally try to deduce the inherent emotion of the same.

The music genre composed and classified with lots of emotional issues acts as an important and
straightforward factor when recognising emotions. This is because some genre have a certain emotion
dominant over other emotions.

So the music features that are deduced from the music genre beforehand may help put up a clear picture of
the emotion being conveyed.
That’s how this two level Support Vector Classifier works, i.e. by implementing a Music Genre Classifier as a
pre-processor, followed by a music emotion recogniser. This justifies the use of acoustic features in addition
to the music emotion recognising features to improve the identification accuracy of the prediction system.

MUSIC FEATURES:

Arousal is determined by timbre, pitch, loudness and the tempo. While the mode and harmony are used to
determine the Valence. The extraction of features is majorly based on four categories of music elements that
are, namely,

Rhythm, Tonality, Timbre and Dynamics.

These elements are laid down as a 35 dimensional feature vector.


Feature Description:

Fig: A

The figure above shows the weight (or the significance) of the shown feature in determining the particular
genre.

The following figure shows the weight of the features in determining the emotion of the music, i.e. the
arousal and valence values of the sampled music.

The next figure, then signifies the result of the MER for the various music genre

Fig: B
Fig: C

A. RHYTHM:
It is the time-variant beat.
#1: Event Density: average frequency of sudden burst of energies.

#2 Tempo represents the timing of a sound event, or as Beats (periodic length of 1/4th note) per
minute or as the global rhythmic feature. It is exciting to listen to fast music with high tempo while, a
slow paced and low tempo music generally is calming or dull.
Regular beats eases while irregular ones make the listener anxious.

#3: Pulse Clarity: ease of perceiving the underlying rhythm of the music.

B. TIMBRE:
#4(Zero Crossing Rate) and #5-17 (first 13 order MFCCs) are used to form the feature vector.
#18: Spectral Roll-off: used to show the amount of high frequency in the signal.
#19: Brightness: for the long term avg spectrum, to represent the energy percentage which is above
a particular cut-off frequency.

C. TONALITY:
#21-#32 is the Chromagram which is used to represent the frequencies and is calculated in a
logarithmic scaled spectrogram, i.e. these 12 chromas relates to the 12 pitch classes(#33). These
pitch classes describes the energy distribution.
Inharmonicity relates to the number of overtone deviations from the fundamental frequency.

D. DYNAMICS:
It detects the vitality of the sampled music. It is done by computing the RMS of the music and it
provides an estimation for the Energy Feature corresponding to the loudness of a syllable.
EXPERIMENTAL RESULTS:
301 songs were randomly selected from the 4 selected genres: POP, ROCK, JAZZ and BLUE. These
genres were so selected because subjects agreements about the emotion being conveyed by these
genres were easiest. Each sample was normalized before sending to the SVM while the performance
evaluation was done by the 10 fold cross-validation

The figure C shows the count of correct estimations of the inherent emotion of a song from a
particular genre. Here, arousal values have been intentionally used because these reflects less
ambiguity than the valence values when determining a sample’s inherent emotion.

ROCK and BLUE corresponds to the highest emotionally expressive samples and hence have the
highest accuracy.

CONCLUSION:

It has been observed that a two level support vector classification is more accurate than a single
level SVC. Also, the relation between the music emotion and genre was observed.

References:

1. N. Scaringella, G. Zoia, D. Mlynek, "Automatic genre classification of music content: A survey", IEEE Signal
Processing Magazine, vol. 23, no. 2, pp. 133-141, Mar. 2006.
2. D. Liu, L. Lu, H. J. Zhang, "Automatic mood detection from acoustic data", 4th Intl. Conf. on Music
Information Retrieval, pp. 81-87, Oct. 2003.
3. P. Lucas, E. Astudillo, E. Peláez, "Human-machine musical composition in real-time based on emotions
through a fuzzy logic approach", IEEE Latin AmericaCongress on Computational Intelligence, pp. 1-6, Oct.
2015.
4. R. E. Thayer, The Biopsychology of Mood and Arousal, Oxford University Press, New York, 1989. Show
Context Google Scholar
5. K. C. Dewi, A. Harjoko, "Kid's song classification based on mood parameters using K-nearest neighbor
classification method and self organizing map", Intl. Conf. on Distributed Framework and Applications, pp. 1-
5, 2010.
6. B. J. Han, S. M. Rho, R. B. Dannenberg, E. J. Hwang, "SMERS: Music emotion recognition using support
vector regression", Proc. of Intl. Society for Music Information Retrieval, pp. 651-656, 2009.
7. K. Markov, T. Matsui, "Music genre and emotion recognition using Gaussian processes", IEEE Access, vol.
2, pp. 688-697, Jun. 2014.
8. C. Y. Chang, C. Y. Lo, C. J. Wang, P. C. Chung, "A music recommendation system with consideration of
personal emotion", Intl. Computer Symposium, pp. 18-23, Dec. 2010. Show Context Google Scholar
9. M. Robnik-Šikonja, I. Kononenko, "Theoretical and empirical analysis of ReliefF and RReliefF", Machine
Learning, vol. 53, no. 1, pp. 23-69, Oct. 2003. Show Context CrossRef Google Scholar 10. C. C. Chang, C. J. Lin,
LIBSVM: A Library for Support Vector Machines, [online] Available:
http://www.csie.ntu.edu.tw/cjlin/libsym. Show Context Google Scholar
11. A. Gabrielsson, E. Lindstrom, "The influence of musical structure on emotional expression" in Music
and Emotion: Theory and Research Oxford University Press, New York, pp. 223-248, 2001.
12. O. Lartillot, P. Toiviainen, "MIR in Matlab (II): A toolbox for musical feature extraction from audio",
8th Int. Conf. on Music Information Retrieval, pp. 127-130, 2007.
13. L. Mion, G. D. Poli, "Score-independent audio features for description of music expression", IEEE
Trans. Audio Speech and Language Processing, vol. 16, no. 2, pp. 458-466, Jan. 2008.
14. M. A. Bartsch, G. H. Wakefield, "Audio thumbnailing of popular music using chroma-based
representations", IEEE Trans. on Multimedia, vol. 7, no. 1, pp. 96-104, Feb. 2005.
15. Y. H. Yang, H. H. Chen, Music Emotion Recognition, CRC Press, 2011.

Das könnte Ihnen auch gefallen