Sie sind auf Seite 1von 28

MINOR PROJECT REPORT

Submitted By:

Shivansh Agarwal
2K16/CO/303

Shivansh Gupta
2K16/CO/304

Tushar Gautam
2K16/CO/333
DECLARATION

We hereby declare that the Minor Project (B. Tech Project-I CO401) work entitled ​“Analysing
vocal patterns for Speech Emotion Recognition using Deep Learning” ​which is being
submitted to the Delhi Technological University, in partial fulfillment of requirements for the
award of degree of Bachelor of Technology in the Department of Computer Science &
Engineering, is a bonafide report of the Minor Project carried out by us. The material contained
in this report has not been submitted to any University or Institution for the award of any degree.

Submitted By:

Shivansh Agarwal
2K16/CO/303

Shivansh Gupta
2K16/CO/304

Tushar Gautam
2K16/CO/333

2
CERTIFICATE

This is to certify that the Minor Project Report entitled ​“Analysing vocal patterns for Speech

Emotion Recognition using Deep Learning” ​is the work of Shivansh Agarwal (2K16/CO/303),

Shivansh Gupta(2K16/CO/304) and Tushar Gautam (2K16/CO/333). This project was

completed under my supervision and form a part of Bachelor of Technology course curriculum

in the Department of Computer Science & Engineering, Delhi Technological University, Delhi.

​Date: ​( Dr. Aruna Bhat )

Associate Professor
Dept. of Computer Science & Engineering
Delhi Technological University

3
ACKNOWLEDGEMENT

First of all, we would like to express our deep sense of respect and gratitude to our project

supervisor ​Dr. Aruna Bhat for providing the opportunity of carrying out this project and being

the guiding force behind this work. We are deeply indebted to her for the support, advice and

encouragement she provided without which the project could not have been a success.

Secondly, we are grateful to ​Dr. Rajni Jindal​, HOD, Computer Science & Engineering

Department, DTU for her immense support. We would also like to acknowledge Delhi

Technological University library and staff for providing the right academic resources and

environment for this work to be carried out.

Last but not the least we would like to express sincere gratitude to our

parents and friends for constantly encouraging us during the completion of work.

Date:
Submitted By:

Shivansh Agarwal
2K16/CO/303

Shivansh Gupta
2K16/CO/304

Tushar Gautam
2K16/CO/333

B.Tech
Department of Computer
Science & Engineering
Delhi Technological University,
Delhi-110042

4
INDEX

S.NO TITLE PAGE NO.

1. Abstract 6

2. Introduction 7

3. General concept 8

4. Motivation 12

5. Related work 13

6. Problem statement 16

7. Tools used 17

8. Proposed Methodology 17

9. Dataset 21

10. Results 22

5
Abstract

This paper presents a method for speech emotion recognition using spectrograms and deep
convolutional neural network (CNN) coupled with transfer learning. Spectrograms generated
from the audio signals are input to the deep CNN. The model consists of convolutional layers
and fully connected layers which extracts discriminative features from spectrogram images and
outputs predictions for the fourteen categories of emotions. In this study, we trained the proposed
model on spectrograms obtained from Toronto Emotional Speech Set (TESS). Furthermore, we
also used voting method to make the results more robust and reliable.

6
Analysing vocal patterns for Speech Emotion Recognition using Deep
Learning

Introduction

An Overview:

Perceiving emotions is automatically and subconsciously conducted by human beings. It is an


essential procedure for any kind of human-to-human interaction, as well as to accomplish better
human-machine interaction, emotions need to be considered. As more and more speech-driven
systems are being developed so machines need to become sensitive to human feelings to better
cater to human-specific processes and provide better performance as per one’s expectations. So it
gives rise to the need for classifying emotions of human beings into some important classes
about which the system should care most about. Speech Emotion Recognition (SER) is an
important step towards making the interaction between machines and humans more natural. It
aims at identifying the psychological state of the speaker based on the acoustic features of the
speech, irrespective of language and content. Various fields will be hugely benefited if machines
are able to accurately recognize the emotional state of the speaker using just acoustic features.
For example, doctors can use SER to better understand the mental state of the patient, call center
agencies can use SER to act according to the emotions of the customer, teachers can adapt their
teaching methodologies according to the student's behavior.
But there are challenges to this task as well. This is because emotion recognition is a very
arduous task as human emotional traits have fuzzy temporal periphery. Recognizing emotional
state from the speech is a very challenging task for the reasons :
1. Specifying the start or the completion of an emotion.
2. A Single speech sample may contain more than one emotion.
3. Not clear which features to choose for distinguishing between emotions because of variations
in acoustic features of the speech. This variability is introduced due to different speaking
styles, different languages, age, gender, and many more factors.
4. There are many instances where the speaker talks in a very sarcastic tone and expresses his/her
emotions from the facial expressions. In these cases, it becomes difficult to estimate emotions
from speech signals alone.

In this paper, we present our approach wherein first we convert speech audio signals to
spectrograms and then use a very deep convolution neural networks for extraction of high-level
features followed by some fully connected layers. Here we do a maximal ensemble/voting of
various architectures for the purpose of feature extraction hence performing the classification
into different emotion classes.

7
General Concept

What does Emotion recognition mean?

Emotion recognition is the process of identifying human ​emotion​, most typically from ​facial
expressions​ as well as from verbal expressions. This is both something that humans do
automatically but computational methodologies have also been developed.

What is the difference between Sentimental Analysis and Emotional Analysis?

Although synonyms, “Sentiment” and “Emotion” do not express the same thing. Looking at the
dictionary, a “sentiment” is defined as an opinion or view. As for the term “emotion”, it refers to
“a strong feeling deriving from one’s mood”.

Now when talking about sentiment and emotional analysis, we’re dealing with two distinct
evaluation methods of people’s moods. They both aim to better understand the readers, and give
insights about their emotional responses.

Sentiment Analysis:

Sentiment analysis aims to catch the general feel or impression people get from consuming a
piece of content. It doesn’t focus on the specific articulate emotions.

It rather relies on a simplified binary system of ​“positive” and “negative” responses​. We only
look to know if the reader had a positive or negative experience with the content.
It’s a simplified analysis method that presents insights that are easy to process and quantify.

This method has proven its efficiency in bringing valuable insights about both audiences and
content. On one hand, it helps businesses put a finger on the preferences and inclinations of
readers. If you get a 60% negative feedback on a blog post, you’d know that your audiences are
not specifically attracted to that article.

Emotional Analysis

Contrary to sentiment analysis, the emotional analysis relies on a ​more sophisticated and
complex system.

While the first one uses a simplified binary categorization, the latter relies on a deeper analysis
of human emotions and sensitivities. This method highlights the nuances between the different
feelings readers express. It’s a more ​meticulous​, thorough look into the degrees and intensities
associated with the deviations of each emotion.

8
Unlike sentiment analysis, emotional analysis is inclusive and considerate of different variations
of human mental subjectivities. It’s usually based on a wide spectrum of moods rather than a
couple of static categories. Inside positive it detects ​specific emotions​ like happiness,
satisfaction, or excitement -depending on how it’s configured.

What are the various features that could be extracted from speech?

Various types of features have been proposed such as mel-frequency cepstral coefficients
(MFCCs), and prosodic features, linear predictive cepstral coefficients (LPCCs), and perceptual
linear predictive coefficients (PLPs) to achieves better consequence in human emotion
recognition. Recently, there has been attention in using substitute feature extraction from an
auditory-inspired long-term spectro-temporal by utilizing a modulation filterbank and an
auditory filterbank for speech decomposition. Below figure presents El Ayadi et al. grouped
different speech feature into 4 different groups include continuous, qualitative, spectral, and TEO
(Teager energy operator)-based features. Although by selecting these features may reached a
good performance but still there is some limitation which forced researchers to make their own
feature sets because most of the standard features are based on short-term analysis and speech
signal in a non-stationary signals.

In our approach we are using the Continuous Speech Features (Pitch) for classification of
emotions by building a spectrogram for each emotion.

Difference that is observed between spectrograms with respect to emotions:

1. Spectrogram for sad emotion

9
2. Spectrogram for happy emotion

The difference is clearly visible between the above two emotions with the help of spectrograms
because the pitch of waves for happy emotion is higher than pitch of waves for sad emotion.

Difference that is observed between spectrograms with respect to age of the speaker:

1. Spectrogram for sad emotion of an old speaker

10
2. Spectrogram for sad emotion of a young speaker

The difference is clearly visible between the age of speakers with the help of spectrograms
because the pitch for a young speaker is higher than pitch for a old speaker.

11
Motivation

This research work is specifically motivated by the following factors :

● Human machine interfaces are commonly used nowadays in many applications. Most
of them require the detection of emotion in the speech. But very few human
machine interfaces being implemented currently are able to achieve that.

● Emotion recognition can greatly enhance various applications like psychotherapy.

● To build state of the artificial intelligent systems, emotion recognition can play a vital
role.

● This can greatly help in enhancing natural language processing.

● With the help of speech emotion recognition, machines will have the ability to have more
human like conversations.

12
Related Work

Different methodologies given for speech emotion recognition

13
LITERATURE SURVEY ON CLASSIFIERS USED IN EMOTION RECOGNITION
USING SPEECH

SL
Ref.
Classifiers
Description
1
Yuanlu Kuang
and Lijuan Li
et al.
(2013)[12]
HMM,
ANN
Propose a Dempster–Shafer evidence theory based decision fusion technique among two
classifiers to classify emotion like angry, sadness, surprise and disgust. It produced 83.86%
recognition rate.

2
Bjorn
Schuller

et
al. (2004)[20]
GMM
System produce 86% recognition rate, where human judge the same corpus at 79.8% recognition
rate.

3
Wen-Yi
Huang, Tsang-
Long

Pao
(2012)[9]
KNN,
HMM,
GMM,
SVM
Propose inclusion of keyword as a feature rather than only speech signal. Final result is obtained
by a fusion technique.

4
Amiya
Kumar

14
et
al. (2015)[19]
SVM
Multilevel SVM is used to identify seven emotions and it is observed that the recognition rate is
82.26%

5
Chung
-
Hsien
Wu
et
al.(2011)[23]
HMM,
SVM,
MLP
Propose meta decision tree(MDT) for the fusion of the outcome of multiple classifiers to
recognize emotions. They also employed a personality trait of a specific speaker obtained from
the Eysenck personality questionnaire and integrated into classifier to obtain emotion. Obtain
85.79% accurate result

6
Chang-Wun
Park
et
al.
(2002)[18]
RNN
Proposed that pitch as an important feature. And give an idea of emotion recognition using RNN.

15
Problem Statement

This research is aimed at implementing a Speech Emotion Recognition system by analysing


vocal patterns using deep learning.

Scope of Work

The approach we have presented can be improved by:


1. Collecting more number of data samples for all classes of emotions and for each age
group.
2. The dataset we used has the sample voices of two female ​actresses (aged 26 and 64 years)
and recordings were made of the set portraying each of seven emotions (anger, disgust,
fear, happiness, pleasant surprise, sadness, and neutral).
Those two actresses were recruited from the Toronto area. Both actresses speak English as their
first language, are university educated, and have musical training. Audiometric testing
indicated that both actresses have thresholds within the normal range.
Thus the method we proposed is accent dependent, that means people belonging to different
geographical places may have different pitch and thus their statements would not be
classified by the algorithm as belonging to the same emotion even if they actually
resemble. Thus future work could be done for making this approach accent independent.
3. The approach is also gender dependent because males usually have a lower pitch than
females. Thus future work could be done for making this approach gender independent.

Organization of Thesis:

16
Tools Used

Major tools : Jupyter Notebook (as IDE ) , Google Colab - Tesla K-80 GPU(for running our
model)
Major Libraries : Keras , matplotlib , numpy.

Proposed Methodology:

In our approach, we have converted the audio signals to their respective ​spectrograms ​and then
fed them as an ​input to some high performing deep neural networks ​based on Inception
Network, Resnet, VGG16, VGG19 in order to obtain a high dimensional feature vector. These
feature vectors act as an input to our ​dense neural network with 2 dense fully connected
layers ​followed by a ​softmax output layer of dense type with 14 outputs corresponding to
each emotion in the TESS dataset.

Audio Spectrogram

The 14 emotions in which the audios are getting classified are :

● Older_angry Young_angry
● Older_disgust ● Young_disgust
● Older_fear ● Young_sad
● Older_happy ● Young_fear
● Older_neutral ● Young_happy
● Older_sad ● Young_neutral
● Older_surprise ● Young_surprise

Spectrogram:

17
Spectrograms are pictorial representations of audio data(sample) where X axis represents the
time and Y axis represents frequency. Intensity is represented by various colors on the
spectrogram. These spectrogram images when passed through custom model help us identify
important features of our audio sample like the shrillness, bass, pitch etc. which are important
when we wish to figure out the corresponding emotion in a person’s voice. As this
spectrogram is representative of the whole audio sample so it is capable of capturing different
nuances of emotion throughout the audio clip.

High Dimensional Feature Extractor:

Pretrained High performance Dense Convolution Neural Network Architectures (which have
been used for image classification in computer vision and have been able to achieve state of the
art results on image datasets) are used for extracting high dimensional features from
spectrographic images of the audio samples.

Convolutional Neural Networks (CNN) is a class of deep neural networks which are used to
interpret visual data and deduce important characteristics of images. It comprises of many layers
which learn from specific segments of an image and then use this combined learnings to
calculate the output. These types of networks are computationally inexpensive to work with as
they involve weight sharing so lesser number of parameters to learn rather than deep fully
connected neural network layers which do not work well due to exponential increase in
complexity are too much parameters to learn. But these dense layers can be used at the end to
find relevant information from the audio sample after complex features have been obtained using
transfer learning on deep CNN architectures as used above. This helps us to take advantage of
these state of the art architectures for purpose of emotion recognition.

The DCNNs we are going to use here are:

1. Inception-Resnet-V2
2. VGG16
3. VGG19
4. Inception-V3

18
19
The VGG 19 Architecture

Inception-v3 Architecture

20
Dataset
We have used a very standard dataset prepared by the University of Toronto. It is called as the ‘
Toronto Emotional Speech Set (TESS)’.These stimuli were modeled on the Northwestern
University Auditory Test No. 6 (NU-6; Tillman & Carhart, 1966). A set of 200 target words were
spoken in the carrier phrase "Say the word _____' by two actresses (aged 26 and 64 years) and
recordings were made of the set portraying each of seven emotions (anger, disgust, fear, happiness,
pleasant surprise, sadness, and neutral). There are 2800 stimuli in total. Two actresses were
recruited from the Toronto area. Both actresses speak English as their first language, are university
educated, and have musical training. Audiometric testing indicated that both actresses have
thresholds within the normal range.
Link for the Dataset : ​https://tspace.library.utoronto.ca/handle/1807/24487

21
Results

We have used a variety of CNN models and their results are as follows :

VGG16

Model accuracy : 96.07 % Report

22
Model Accuracy Curve :

Model Loss Curve :

VGG19

Model Accuracy : 97.86 %

23
Report :

Model Accuracy Curve :

24
Model Loss Curve :

InceptionV3

Model Accuracy : 94.64 % Report :

25
Model Accuracy Curve :

Model Loss Curve :

26
Inception Resnet V2

Model Accuracy : 99.29 % Report :

Model Accuracy Curve :

27
Model Loss Curve :

Comparison

S. No. PAPER YEAR ACCURACY

1. VGG16 2019 96.07 %

2. VGG19 2019 97.86 %

3. InceptionV3 2019 94.64 %

4. Inception-Resnet-V2 2019 99.29 %

5. Cross Correlation and 2018 91.3 %


Acoustic Features
6. SVM 2016 96 %

7. Humans 2011 82 %

28

Das könnte Ihnen auch gefallen