Beruflich Dokumente
Kultur Dokumente
Submitted By:
Shivansh Agarwal
2K16/CO/303
Shivansh Gupta
2K16/CO/304
Tushar Gautam
2K16/CO/333
DECLARATION
We hereby declare that the Minor Project (B. Tech Project-I CO401) work entitled “Analysing
vocal patterns for Speech Emotion Recognition using Deep Learning” which is being
submitted to the Delhi Technological University, in partial fulfillment of requirements for the
award of degree of Bachelor of Technology in the Department of Computer Science &
Engineering, is a bonafide report of the Minor Project carried out by us. The material contained
in this report has not been submitted to any University or Institution for the award of any degree.
Submitted By:
Shivansh Agarwal
2K16/CO/303
Shivansh Gupta
2K16/CO/304
Tushar Gautam
2K16/CO/333
2
CERTIFICATE
This is to certify that the Minor Project Report entitled “Analysing vocal patterns for Speech
Emotion Recognition using Deep Learning” is the work of Shivansh Agarwal (2K16/CO/303),
completed under my supervision and form a part of Bachelor of Technology course curriculum
in the Department of Computer Science & Engineering, Delhi Technological University, Delhi.
Associate Professor
Dept. of Computer Science & Engineering
Delhi Technological University
3
ACKNOWLEDGEMENT
First of all, we would like to express our deep sense of respect and gratitude to our project
supervisor Dr. Aruna Bhat for providing the opportunity of carrying out this project and being
the guiding force behind this work. We are deeply indebted to her for the support, advice and
encouragement she provided without which the project could not have been a success.
Secondly, we are grateful to Dr. Rajni Jindal, HOD, Computer Science & Engineering
Department, DTU for her immense support. We would also like to acknowledge Delhi
Technological University library and staff for providing the right academic resources and
Last but not the least we would like to express sincere gratitude to our
parents and friends for constantly encouraging us during the completion of work.
Date:
Submitted By:
Shivansh Agarwal
2K16/CO/303
Shivansh Gupta
2K16/CO/304
Tushar Gautam
2K16/CO/333
B.Tech
Department of Computer
Science & Engineering
Delhi Technological University,
Delhi-110042
4
INDEX
1. Abstract 6
2. Introduction 7
3. General concept 8
4. Motivation 12
5. Related work 13
6. Problem statement 16
7. Tools used 17
8. Proposed Methodology 17
9. Dataset 21
10. Results 22
5
Abstract
This paper presents a method for speech emotion recognition using spectrograms and deep
convolutional neural network (CNN) coupled with transfer learning. Spectrograms generated
from the audio signals are input to the deep CNN. The model consists of convolutional layers
and fully connected layers which extracts discriminative features from spectrogram images and
outputs predictions for the fourteen categories of emotions. In this study, we trained the proposed
model on spectrograms obtained from Toronto Emotional Speech Set (TESS). Furthermore, we
also used voting method to make the results more robust and reliable.
6
Analysing vocal patterns for Speech Emotion Recognition using Deep
Learning
Introduction
An Overview:
In this paper, we present our approach wherein first we convert speech audio signals to
spectrograms and then use a very deep convolution neural networks for extraction of high-level
features followed by some fully connected layers. Here we do a maximal ensemble/voting of
various architectures for the purpose of feature extraction hence performing the classification
into different emotion classes.
7
General Concept
Emotion recognition is the process of identifying human emotion, most typically from facial
expressions as well as from verbal expressions. This is both something that humans do
automatically but computational methodologies have also been developed.
Although synonyms, “Sentiment” and “Emotion” do not express the same thing. Looking at the
dictionary, a “sentiment” is defined as an opinion or view. As for the term “emotion”, it refers to
“a strong feeling deriving from one’s mood”.
Now when talking about sentiment and emotional analysis, we’re dealing with two distinct
evaluation methods of people’s moods. They both aim to better understand the readers, and give
insights about their emotional responses.
Sentiment Analysis:
Sentiment analysis aims to catch the general feel or impression people get from consuming a
piece of content. It doesn’t focus on the specific articulate emotions.
It rather relies on a simplified binary system of “positive” and “negative” responses. We only
look to know if the reader had a positive or negative experience with the content.
It’s a simplified analysis method that presents insights that are easy to process and quantify.
This method has proven its efficiency in bringing valuable insights about both audiences and
content. On one hand, it helps businesses put a finger on the preferences and inclinations of
readers. If you get a 60% negative feedback on a blog post, you’d know that your audiences are
not specifically attracted to that article.
Emotional Analysis
Contrary to sentiment analysis, the emotional analysis relies on a more sophisticated and
complex system.
While the first one uses a simplified binary categorization, the latter relies on a deeper analysis
of human emotions and sensitivities. This method highlights the nuances between the different
feelings readers express. It’s a more meticulous, thorough look into the degrees and intensities
associated with the deviations of each emotion.
8
Unlike sentiment analysis, emotional analysis is inclusive and considerate of different variations
of human mental subjectivities. It’s usually based on a wide spectrum of moods rather than a
couple of static categories. Inside positive it detects specific emotions like happiness,
satisfaction, or excitement -depending on how it’s configured.
What are the various features that could be extracted from speech?
Various types of features have been proposed such as mel-frequency cepstral coefficients
(MFCCs), and prosodic features, linear predictive cepstral coefficients (LPCCs), and perceptual
linear predictive coefficients (PLPs) to achieves better consequence in human emotion
recognition. Recently, there has been attention in using substitute feature extraction from an
auditory-inspired long-term spectro-temporal by utilizing a modulation filterbank and an
auditory filterbank for speech decomposition. Below figure presents El Ayadi et al. grouped
different speech feature into 4 different groups include continuous, qualitative, spectral, and TEO
(Teager energy operator)-based features. Although by selecting these features may reached a
good performance but still there is some limitation which forced researchers to make their own
feature sets because most of the standard features are based on short-term analysis and speech
signal in a non-stationary signals.
In our approach we are using the Continuous Speech Features (Pitch) for classification of
emotions by building a spectrogram for each emotion.
9
2. Spectrogram for happy emotion
The difference is clearly visible between the above two emotions with the help of spectrograms
because the pitch of waves for happy emotion is higher than pitch of waves for sad emotion.
Difference that is observed between spectrograms with respect to age of the speaker:
10
2. Spectrogram for sad emotion of a young speaker
The difference is clearly visible between the age of speakers with the help of spectrograms
because the pitch for a young speaker is higher than pitch for a old speaker.
11
Motivation
● Human machine interfaces are commonly used nowadays in many applications. Most
of them require the detection of emotion in the speech. But very few human
machine interfaces being implemented currently are able to achieve that.
● To build state of the artificial intelligent systems, emotion recognition can play a vital
role.
● With the help of speech emotion recognition, machines will have the ability to have more
human like conversations.
12
Related Work
13
LITERATURE SURVEY ON CLASSIFIERS USED IN EMOTION RECOGNITION
USING SPEECH
SL
Ref.
Classifiers
Description
1
Yuanlu Kuang
and Lijuan Li
et al.
(2013)[12]
HMM,
ANN
Propose a Dempster–Shafer evidence theory based decision fusion technique among two
classifiers to classify emotion like angry, sadness, surprise and disgust. It produced 83.86%
recognition rate.
2
Bjorn
Schuller
et
al. (2004)[20]
GMM
System produce 86% recognition rate, where human judge the same corpus at 79.8% recognition
rate.
3
Wen-Yi
Huang, Tsang-
Long
Pao
(2012)[9]
KNN,
HMM,
GMM,
SVM
Propose inclusion of keyword as a feature rather than only speech signal. Final result is obtained
by a fusion technique.
4
Amiya
Kumar
14
et
al. (2015)[19]
SVM
Multilevel SVM is used to identify seven emotions and it is observed that the recognition rate is
82.26%
5
Chung
-
Hsien
Wu
et
al.(2011)[23]
HMM,
SVM,
MLP
Propose meta decision tree(MDT) for the fusion of the outcome of multiple classifiers to
recognize emotions. They also employed a personality trait of a specific speaker obtained from
the Eysenck personality questionnaire and integrated into classifier to obtain emotion. Obtain
85.79% accurate result
6
Chang-Wun
Park
et
al.
(2002)[18]
RNN
Proposed that pitch as an important feature. And give an idea of emotion recognition using RNN.
15
Problem Statement
Scope of Work
Organization of Thesis:
16
Tools Used
Major tools : Jupyter Notebook (as IDE ) , Google Colab - Tesla K-80 GPU(for running our
model)
Major Libraries : Keras , matplotlib , numpy.
Proposed Methodology:
In our approach, we have converted the audio signals to their respective spectrograms and then
fed them as an input to some high performing deep neural networks based on Inception
Network, Resnet, VGG16, VGG19 in order to obtain a high dimensional feature vector. These
feature vectors act as an input to our dense neural network with 2 dense fully connected
layers followed by a softmax output layer of dense type with 14 outputs corresponding to
each emotion in the TESS dataset.
Audio Spectrogram
● Older_angry Young_angry
● Older_disgust ● Young_disgust
● Older_fear ● Young_sad
● Older_happy ● Young_fear
● Older_neutral ● Young_happy
● Older_sad ● Young_neutral
● Older_surprise ● Young_surprise
Spectrogram:
17
Spectrograms are pictorial representations of audio data(sample) where X axis represents the
time and Y axis represents frequency. Intensity is represented by various colors on the
spectrogram. These spectrogram images when passed through custom model help us identify
important features of our audio sample like the shrillness, bass, pitch etc. which are important
when we wish to figure out the corresponding emotion in a person’s voice. As this
spectrogram is representative of the whole audio sample so it is capable of capturing different
nuances of emotion throughout the audio clip.
Pretrained High performance Dense Convolution Neural Network Architectures (which have
been used for image classification in computer vision and have been able to achieve state of the
art results on image datasets) are used for extracting high dimensional features from
spectrographic images of the audio samples.
Convolutional Neural Networks (CNN) is a class of deep neural networks which are used to
interpret visual data and deduce important characteristics of images. It comprises of many layers
which learn from specific segments of an image and then use this combined learnings to
calculate the output. These types of networks are computationally inexpensive to work with as
they involve weight sharing so lesser number of parameters to learn rather than deep fully
connected neural network layers which do not work well due to exponential increase in
complexity are too much parameters to learn. But these dense layers can be used at the end to
find relevant information from the audio sample after complex features have been obtained using
transfer learning on deep CNN architectures as used above. This helps us to take advantage of
these state of the art architectures for purpose of emotion recognition.
1. Inception-Resnet-V2
2. VGG16
3. VGG19
4. Inception-V3
18
19
The VGG 19 Architecture
Inception-v3 Architecture
20
Dataset
We have used a very standard dataset prepared by the University of Toronto. It is called as the ‘
Toronto Emotional Speech Set (TESS)’.These stimuli were modeled on the Northwestern
University Auditory Test No. 6 (NU-6; Tillman & Carhart, 1966). A set of 200 target words were
spoken in the carrier phrase "Say the word _____' by two actresses (aged 26 and 64 years) and
recordings were made of the set portraying each of seven emotions (anger, disgust, fear, happiness,
pleasant surprise, sadness, and neutral). There are 2800 stimuli in total. Two actresses were
recruited from the Toronto area. Both actresses speak English as their first language, are university
educated, and have musical training. Audiometric testing indicated that both actresses have
thresholds within the normal range.
Link for the Dataset : https://tspace.library.utoronto.ca/handle/1807/24487
21
Results
We have used a variety of CNN models and their results are as follows :
VGG16
22
Model Accuracy Curve :
VGG19
23
Report :
24
Model Loss Curve :
InceptionV3
25
Model Accuracy Curve :
26
Inception Resnet V2
27
Model Loss Curve :
Comparison
7. Humans 2011 82 %
28