Sie sind auf Seite 1von 10

Journal of Information & Computational Science 9: 1 (2012) 3544

Available at http://www.joics.com

Music Sentiment Classication Integrating


Audio with Lyrics
Jiang Zhong a,b,,
a College
b Key

Yifeng Cheng a , Siyuan Yang a , Luosheng Wen a

of Computer Science, Chongqing University, Chongqing 400044, China

Laboratory of Dependable Service Computing in Cyber Physical Society


Ministry of Education, Chongqing 400044, China

Abstract
In view of distinct levels of the association of a term with dierent classes, an improved dierence-based
CHI approach was proposed to extract discriminative aective words form lyrics text. Then Support
Vector Machine (SVM) classier was constructed to apply the selected features, obviously increasing lyric
sentiment classication performance by using the same amount of features with CHI. Serial feature fusion
combining lyric features selected by the improved method with audio features is applied to benet the
classication task. A hierarchical mood detection framework using fused feature sets was put forward.
The experimental results verify the eciency of this fusion.
Keywords: Music Sentiment Classication; CHI Approach; Support Vector Machine (SVM); Lyric

Introduction

With the growing amount of digital music and human various demands to music information
retrieval, automatic music sentiment analysis is becoming an important and essential task for
various system and applications such as music organization, song selection in mobile devices [1],
music recommendation [2] and so on. This task is to automatically mark a song using aective
labels in an emotion set specied by psychologists. In the last few years, it has attracted more
and more attention and wide range of related researches have been carried out.
To date, there are two popular sources used to analyze music emotion. At rst, audio mood
analysis of music is solely employed to detect songs emotion. With further exploration, another
source gives another solution in music sentiment analysis. That is exploring sentiment from songs
lyrics which contain relevant aective information which is not resided in audio. But these are

Project supported by Supported by Natural Science Foundation Project of CQ CSTC (2010BB2046, 2009BB2184), the Third Stage Building of 211 Project (Grant No. S-10218), the National Basic Research Program
of China (Grant No. 2011CB302600) and the Fundamental Research Funds for the Central Universities (Project
No. CDJZR10 18 00 25).

Corresponding author.
Email address: zhongjiang@cqu.edu.cn (Jiang Zhong).

15487741 / Copyright 2012 Binary Information Press


January 2012

36

J. Zhong et al. / Journal of Information & Computational Science 9: 1 (2012) 3544

all single sources and the performances are not good enough. Inspired by the single source, the
co-training algorithm combining multiple dierent sources is proposed.
In order to make full use of the above-mentioned two sources, data fusion approach mainly
refers to feature fusion is introduced in this paper. In recent years, data fusion has been developed
rapidly and applied widely in many pattern recognition areas [3] [4] [5] [6]. And feature fusion, one
kind of data fusion, plays a very important role in data fusion [7]. Methods of feature fusion deal
with the selection and combination of features to remove redundant and irrelevant features from
the correlation between distinct feature sets involved in fusion [8]. It is capable of deriving and
gaining the most eective and least-dimensional feature vector sets that help the nal decision
[7]. Therefore, the dierent feature sets are fused together to form a better feature set, which is
an input of classier to get the nal results.
There are two popular feature combination strategies in feature fusion: serial fusion based on
serial feature combination and parallel fusion based on parallel feature combination.
The state-of-the-art supervised approaches for sentiment analysis are machine learning techniques. Given the sentiment features, a sentiment classier is trained and adopted to classify the
unlabeled instances. But the labeled process is time and energy consuming.
In lyric sentiment analysis task, another unsupervised method is put into use. This is based on
sentiment dictionary such as HowNet or syntactic rules. However, the disadvantage lies in below
aspect: it depends on the quality of sentiment dictionary or the rules developed manually very
much.
The rest of this paper is organized as follows. In Section 2, some related works are presented.
Feature sets of audio and lyric text we used in this paper are described in Section 3. Section
4 introduces the hierarchical mood detection framework. In Section 5, our experiment setup is
given, including the dataset, evaluation measurements and experiment results and discussion. In
Section 6, conclusion and future work are drawn.

2
2.1

Related Work
Music Mood Taxonomy

There are two mood model types of how human perceive emotion: category model and dimensional model. The rst one consists of a number of separate basic moods, such as happy, sad,
anger, horror and so on [9] [10] [11]. These dierent combinations of basic moods formulate all
human emotions. But these moods are varied quickly, since the fundamental moods are discontinuous. The second is dimensional model. James Russells denes the two dimension valence
(negative-positive) and arousal (high-low) [12], corresponding to Thayers. And this formulation
is used widely now. In the 1990s, Thayer [13] put forward a two-dimensional model. This model represents moods as points in a two-dimensional space: Stress (happy/anxious) and Energy
(calm/energy) vector plane. In this model, all kinds of moods are continuous, the smooth and
gradual transition among diverse moods can be achieved and the similarity or dierence can be
measured from the distance in the space.
However there is no absolutely standard model in human mood. In this paper, we apply
Thayers model and divide the moods into four clusters: contentment, exuberance, relaxed and

exuberance

energy

J. Zhong et al. / Journal of Information & Computational Science 9: 1 (2012) 3544

37

anxious/frantic
stress

contentment

depression

Fig. 1: Thayers mood model


anxious/frantic. The reason is reviewed in [14] in detail.

2.2
2.2.1

Sentiment Classication
Classication Using Single Sources

So far, most researches focus on audio sentiment information. [15] casts the emotion detection
problem as a multi-label classication problem and deals with to the problem using multi-label
classication based on audio. [16] rstly extracts audio features, and then develops a fuzzy approach to determine how likely the song segment belongs to an emotion class. [17] bases on audio
sentiment features including timbre, intensity and rhythmic features contain rhythm strength,
rhythm regularity, and tempo [18] [19] and presents a hierarchical framework to automate the
task of mood detection from acoustic music data. [Dan Yang, Won Sook Lee] detects music mood
from lyrics using software agents, and the results are promising. [20] applies sentiment Vector
Space Model (s-VSM) to represent song lyric document and prove this model is better than Vector
Space Model (VSM) in sentiment classication.
2.2.2

Classication Using Multiple Sources

[21] combines the best lyric features extracted by their own proposed feature selection approach
and audio features to investigate songs mood. The experiment shows that this approach outperformed the two single sources. [22] uses three modalities which are audio, lyric and MIDI
to develop three variants of standard co-training algorithm. Experiment results verify the good
performance of the methods.
Besides, [23] exploits both audio features and collaborative user annotations, fusing them to
improve overall performance.

3
3.1

Music Sentiment Feature Extraction


Acoustic Feature

Currently, audio sentiment characteristic includes prosody and timbre Features. Prosody mainly
contain pitch frequency (F0), amplitude (or energy), pronunciation duration and so on. Timbre
composes of the spectral shape and spectral contrast features consist of Formant, Spectral Energy
Distribution, Harmonic noise ratio and so forth. In this paper, the following features are extracted:
Mel Frequency Cepstral Coecients (MFCC), Linear Prediction Coecients (LPC), Centroid of

38

J. Zhong et al. / Journal of Information & Computational Science 9: 1 (2012) 3544

Spectral, Spectral Roll o, Spectral ux, short-time zero-crossing rate, the music tempo, shorttime energy and so on.

3.2

Lyric Text Feature

Previous work proves lyric is supplemental to audio to some extent. The lyrics are text and
are sentiment text, so lyric text help to songs sentiment classication. An intuitive approach
is to view text sentiment classication as a task of text categorization. Therefore this paper
extracts term frequency from every songs lyric as text sentiment feature and trains Support
Vector Machine (SVM) classier to work out sentiment analysis.
3.2.1

CHI Feature Selection Approach

CHI approach is one of common feature selection methods in text categorization. It measures
the degree of correlation between term w and category c and assumes the distribution between
them is rst-order chi-square distribution: 2 . The bigger the chi statistic value of term w to
category ci ,the stronger the association between them. The equation 1 denes chi value of term
w to category ci .
N (A D B C)
2 =
(1)
(A + C) (A + D) (B + D) (C + D)
In the formulation 1, A stands for the number of documents which belong to category ci and
contain term w. B is on behalf of the number of documents which belong to category ci but dont
contain term w. C is the number of documents which dont belong to category ci but contain
term w. D is the number of documents which dont belong to category ci and dont contain term
w. N is the total number of all documents in the train set.
This essay adopts global strategy which chooses the maximum chi of a term to dierent category
ci as the terms CHI to all the whole corpus. The formulation is as follow.
2 (w) = max{2 (w, ci ), ci C}

(2)

When selecting discriminative terms, the approach rstly selects the terms have bigger CHI
value.
3.2.2

Improved Dierence Based CHI (DIFCHI)

Given two categories, the conventional CHI approach bases on the average or the maximum CHI
value of a term to dierent categories to select discriminative terms. That means dierent degree
of associations between a term and diverse categories are taken into account. But this omits the
relative dierence of distinct associations. There are some such terms: the CHI values to dierent
categories are all big whereas the relative dierence of these values is small. So the distinction
roles of these terms are weaker than those which have big CHI values and big relative dierence of
these values. However, feature selection aims to extract terms are good to distinguish categories.
Here an example to demonstrate above situation. There are two terms w1 , w2 and two categories
C1 , C2 . Table 1 shows the dierent CHI values.

J. Zhong et al. / Journal of Information & Computational Science 9: 1 (2012) 3544

39

Table 1: Classication precision and F1 using audio features lonely


Class
Term
C1

C2

w1

0.60

0.62

w2

0.40

0.55

To complete the task of classifying C1 and C2 , it is common practice to select term w1 according
to accustomed global strategy. From Table 1, we know that the CHI values of w1 to categories C1
and C2 (0.6, 0.65 respectively) are big and their dierence is small (0.05). Meanwhile, the CHI
values of w2 to categories C1 and C2 (0.40, 0.55 respectively) are big and their dierence is also
big (0.15). Which term does have the discriminative predominance to classify these documents
have both the two terms in C1 ,C2 ? It is apparent that this term should be w2 . Because the
dierence of the two CHI values of term w2 0.15 is much bigger than the one of term w1 0.05. So
we can have a hypothesis that the terms have big CHI values and big dierence of CHI values
have discriminative superiority. Given the above analysis, we propose the dierence based CHI
feature selection approach.
The denition of Dierence is as follow: the dierence of associations of a term with distinct
two categories. Equation 3 shows it.
abs(chii chij )
dif (w) =
max(chii , chij )

(3)

In the formulation 3, chii and chij are CHI values of term w to categories Ci and Cj .
When extracted discriminative terms from candidates, the terms have big dif (w) are preferred
choice.

Hierarchical Mood Detection Framework

D. Liu et al [9] present a hierarchical mood detection framework based on audio to automate
mood detection from acoustic data. The advantage of hierarchical framework can be gure out
in literature [9]. The framework is illustrated in Fig. 2.
As it can be seen from Fig.2, only audio data is employed in the whole process. For the sake
of taking advantage of lyric data, we rene the framework importing lyric data. The modied
framework is showed in Fig. 3.
Compared to the two diagrams, the evident dierence lies in the used feature sets between
group and explicit clusters. In Fig. 2, the feature B and feature C are subset of audio feature set.
While the fused feature B and fused feature C in Fig. 3 are the fusion feature set of audio and
lyric. The classication process of rened hierarchical framework is as follow. Firstly, the music
songs are divided into group based on audio data mainly the acoustic energy. Then mixed feature
set is applied to further classify them into the four clusters.

40

J. Zhong et al. / Journal of Information & Computational Science 9: 1 (2012) 3544

Little

Music
clip

Big

Intensity
Group
1

Group
2

Audio
Feature B

Audio
Feature C

Contentment

Depression

Exuberance

Anxious

Fig. 2: Hierarchical framework


Music
clip
Audio
Group
1

Group
2

Fused
Feature B

Fused
Feature C

Contentment

Depression

Exuberance

Anxious

Fig. 3: Rened hierarchical framework

Experiment Setup

The task of music sentiment classication is training a classier. In the experiment, a data set is
collected, including wave les and corresponding lyrics text. A SVM classier is applied because
of its overall good performance. Firstly, we only use audio features. Then, lyrics features are
employed to test hypothesis. At last, audio features and lyric features are combined to verify
whether this fusion approach can improve the classication accuracy. The experiment results are
got from multiple cross-validation.

5.1

Data Set Collection

Standard and public available Chinese mandarin songs data set is much less. So we rstly construct this kind of data set. 500 mandarin songs are downloaded from Googles music mood
classication, including audio and lyrics. Though every song has its own mood tag, we still
conrm them manually. In the rst place, we nd out whether the song is in Baidus music
mood classication and the two mood tags are unanimous. If so, the song is add to candidates
list. Other cases are ignored. Then three people are familiar to music labeled mood tags for
those songs in the candidates list. In this whole process, all of them are independent of each
other, knowing nothing about the other twos work. After this work, 400 songs have consistent
tag stay back as corpus. Preprocess is applied to every songs lyric, mainly eliminating the text

J. Zhong et al. / Journal of Information & Computational Science 9: 1 (2012) 3544

41

contents have nothing to do with the song. In the task of audio sentiment classication, short
audio segment with annotation are usually used for training set. Before experiment, all the songs
are divided into 30-second segments. Then these segments are converted into 16 kHz, 16-bit and
mono wave audio. 32ms window length and half frame oset Hamming window is introduced to
the segments.

5.2

Evaluation Criterion

Commonly used classication evaluation criterions are precision (P), recall (R), F1 and so on.
Their denitions are as follows.
P =

the number of documents classif ied correctly in a category


100%
the number of documents identif ied as the category

(4)

R=

the number of documents classif ied correctly in a category


100%
the total number of documents the category in test set

(5)

2P R
100%
P +R

(6)

F1 =

In our experiments, we adopt precision (P) and F1 to test the classication system performance.

5.3
5.3.1

Results and Discussion


Classication Based on Audio

We obtained the results using libsvm in Weka. Table 2 shows the classication results based on
audio features lonely.
Table 2: Classication precision and F1 using audio features lonely
Feature Type

Precision (100%)

F1 (100%)

Audio

73.00

72.30

Table 2 illustrates the performance we got is not satisfying. The precision and F1 are both
little more than 70%, and F1 is less than precision. So we argue that audio-based sentiment
classication is not a satisfactory approach.
5.3.2

Classication Based on Lyric

In term of precision and F1, we compared the two CHI approach: conventional CHI and our
proposed improved Dierence-based CHI Approach (DIFCHI). Fig. 4 and Fig. 5 are the precision
and F1 of two methods respectively. We run 10 runs of 10-fold cross-validation to get the results.
The results are illustrated in Fig. 4 and Fig. 5.
As can be seen from Fig. 4, with respect to classication precision, the average value of accustomed CHI approach is about 81.2%. And with the increase of feature dimension, there is no
improvement in precision, even a little decrease in 2000 feature dimensions. While the minimum

42

J. Zhong et al. / Journal of Information & Computational Science 9: 1 (2012) 3544


0.88
0.86
0.86
0.84

CHI
DIFCHI

0.82

0.82

CHI
DIFCHI

F1

Precision

0.84

0.80
0.80
0.78
0.78
0.76
0.76
1500

2000
2500
Feature dimension

3000

0.71
1500

2000
2500
Feature dimension

3000

Fig. 4: CHI and DIFCHI classication preci-

Fig. 5: CHI and DIFCHI classication F1 con-

sion contrast Fig.

trast Fig.

value of DIFCHI method is 85%, 4% higher than CHI. And the more features, the higher the
precision. It conrms the DIFCHI hypothesis and indicates that DIFCHI could extract much
more discriminative terms and improve the precision obviously as the feature dimension is the
same with CHI.
As Fig. 5 shows, the F1 value of CHI is about 81%, and no boost in the whole process, even
go down to 79.2% in 2000 feature dimensions. On the contrary, the minimum F1 of DIFCHI is
about 85%. And F1 rise as the increase in feature dimension. The results of precision and F1
are consistent with each other. This declares that the DIFCHI is eective in the feature selection
phase.
5.3.3

Classication Integrating Audio and Lyric

Fusion methods can be used to exibly integrate heterogeneous data sources to improve classication performance. There are two popular methods fusing both audio and lyric in the task of
music sentiment classication: feature fusion and classier fusion. The former is combining all
dierent types of feature vectors and fusing them into one single vector. It extracts audio features
and lyric features separately and only a nal classier is constructed based on the combined feature set. The latter constructs two independent classiers: audio-based classier and lyric-based
classier. Then it decides the nal result based on the two classier output according to certain
strategy such as voting and so on.
The introduction mentioned that two popular feature combination approaches. Serial combination is only cascading all the various feature sets. In parallel combination, two dierent feature
sets are combined into a complex feature set. Then feature selection and transformation techniques are lead in to reduce the dimensionality, such as Principal Component Analysis (PCA),
Independent Component Analysis (ICA), Linear Discriminant Analysis (LDA) and so on. The
serial fusion strategy is adopted in our paper. We concatenate audio and lyric two dierent feature sets into one feature vector space and classication algorithm runs on the fusion features.
The results are showed in Table 3.
A denotes audio, L(CHI) stands for lyric feature extracted by CHI approach, L(DIFCHI) is
lyric feature selected by DIFCHI method. The results of audio are the baseline. It is can be seen

43

J. Zhong et al. / Journal of Information & Computational Science 9: 1 (2012) 3544

Table 3: Classication precision and F1 using audio features lonely


Feature Type

Precision (100%)

F1 (100%)

F1 Improvement (100%)

73.00

72.30

0.00

A+L(CHI)

85.25

86.46

14.16

A+L(DIFCHI)

87.41

88.75

16.45

from Table 2, the F1 improvements of two approaches are respectively 14.16% and 16.45%, and
the A+L (DIFCHI) obtains the best result. Compared to diagram 1 and diagram 2, the fusion
approach achieves better performance than both single-sources: audio and lyric.

Conclusion and Future Work

The DIFCHI method in this paper obtains better performance than conventional CHI in lyric
sentiment classication task. This method can select more discriminative terms and improve
classication performance, given the same feature dimension as CHI. Then the fusion of two
feature sets gets: audio and best DIFCHI the highest precision and F1. So we can draw the
conclusion that fusion approach using heterogeneous data sources help to improve music sentiment
classication.
In this paper, there are more than these four categories: happy and sad. We will study the
sentiment classication task of more categories in the future. Parallel feature fusion and classier
fusion are the next future work.

References
[1]
[2]

[3]
[4]
[5]

[6]
[7]
[8]

M. Tolos, R. Tato, T. Kemp. Mood-based navigation through large collections of musical data. In
2nd IEEE Consumer Communications and Networking Conference (CCNC 2005), 2005, 71-75
Rui Cai, Chao Zhang, Chong Wang, Lei Zhang, Wei-ying Ma. Musicsense: Contextual music
recommendation using emotional allocation modeling. In MULTIMEDIA07: Proceedings of the
15th international conference on Multimedia, 2007, 553-556
H. C. Chiang, R. L. Moses, L. C. Potter. Model-based Bayesian feature matching with application
to synthetic aperture radar target recognition, Pattern Recognition, 34(8), 2001, 1539-1553
T. Peli, Mon Young, R. Knox et al. Feature level sensor fusion, Proceedings of the SPIE Sensor
Fusion: Architectures, Algorithms and Applications III, 3719, 1999, 332-339
N. Doi, A. Shintani, Y. Hayashi et al. A study on month shape features suitable for HMM speech
recognition using fusion of visual and auditory information, IEICE Trans. Fundam. E78-A (11),
1995, 1548-1552
A. H. Gunatilaka, B. A. Baertlein. Feature-level and decision-level fusion of no coincidently sampled
sensors for land mine detection, IEEE Trans. Pattern Anal. Mach. Intell. 23(6), 2001, 577-589
Jian Yang, Jing-yu Yang, David Zhang et al. Feature fusion: Parallel strategy vs. serial strategy.
Pattern Recognition, 36, 2003, 1369-1381
U. G. Mangai, S. Samanta, S. Das, P. R. Chowdhury. A survey of decision fusion and feature
fusion strategies for pattern classication. IETE Tech. Rev., 27, 2010, 293-307.

44
[9]

J. Zhong et al. / Journal of Information & Computational Science 9: 1 (2012) 3544


P. Ekman, An argument for basic emotions. Cognition and Emotion, 6(3/4), 1992, 169-200

[10] K. Hevner. Experimental studies of the elements of expression in music. American Journal of
Psychology, 48(2), 1979, 246-268
[11] J. A. Russell. Aective space is bipolar. Journal of Personality and Social Psychology, 37(3), 1979,
345-356
[12] R. E. Thayer. The Biopsychology of Mood and Arousal. Oxford University Press, 1989, USA
[13] Rong-chu Wei, R. T. Tsai, Ying-sian Wu et al. LAMP, A lyrics and audio mandopop dataset for music mood estimation. Dataset Compilation, System Construction, and Testing: 2010 International
Conference on Technologies and Applications of Articial Intelligence, 2010. NW Washington:
IEEE Computer Society, 2010, 53-59
[14] Tao Li, Mitsunori Ogihara. Detecting emotion in music. Proceedings of the International Symposium on Music Information Retrieval, Washington D.C., USA. 2003, 239-240
[15] Yi-hsuan Yang, Chia-chu Liu, H. H. Chen. Music emotion classication: A fuzzy approach. Proceedings of the 14th Annual ACM International Conference, 2006, 81-84
[16] Lie Lu, Dan Liu, Hong-jiang Zhang. Automatic mood detection and tracking of music audio signals.
IEEE Transactions on Audio, Speech, and Language Processing, 14(1), 2006, 5-18
[17] P. N. Juslin. Cue utilization in communication of emotion in music performance: Relating performance to perception, J. Exper. Psychol.: Human Percept. Perf., 16(6), 2000, 1797-1813
[18] D. Liu, L. Lu, H. J. Zhang. Automatic mood detection from acoustic music data. Proceedings of
the International Symposium on Music Information Retrieval (ISMIR03), 2003
[19] D. Yang, W. Lee. Music emotion identication from lyrics. Proceedings of the 2009 11th IEEE
International Symposium on Multimedia, 2009, 624-629
[20] Yunqing Xia, Linlin Wang, Kam-Fai Wong et al. Sentiment vector space model for lyric-based
song sentiment classication. Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, 2008, 133-136
[21] Xiao Hu, J. Stephen Downie. Improving mood classication in music digital libraries by combining
lyrics and audio. 2010 Proceedings of the 10th annual joint conference on Digital libraries, 2010,
45-48
[22] Yongkai Zhao, Deshun Yang, Xiaoou Chen. Multi-modal music mood classication using cotraining. International Conference of Computational Intelligence and Software Engineering (CiSE
2010), 2010, 1-4
[23] K. Bischo, C. S. Firan, R. Paiu et al. Music mood and theme classication-A hybrid approach.
10th International Society for Music Information Retrieval Conference (ISMIR 2009), 2009, 657662