Beruflich Dokumente
Kultur Dokumente
Spectrograms
Ling He, Margaret Lech, Namunu Maddage
School of Electrical and Computer Engineering
RMIT University, Australia
Nicholas Allen
Department of Psychology
The University of Melbourne, Australia
ling.he@student.rmit.edu.au
nba@unimelb.edu.au
Abstract
We present new methods that extract characteristic
features from speech magnitude spectrograms. Two of
the presented approaches have been found particularly
efficient in the process of automatic stress and emotion
classification. In the first approach, the spectrograms
are sub-divided into ERB frequency bands and the
average energy for each band is calculated. In the
second approach, the spectrograms are passed through
a bank of 12 log-Gabor filters and the outputs are
averaged and passed through an optimal feature
selection procedure based on mutual information
criteria. The proposed methods were tested using single
vowels, words and sentences from SUSAS data base
with 3 classes of stress, and spontaneous speech
recordings made by psychologists (ORI) with 5
emotional classes. The classification results based on
the Gaussian mixture model show correct classification
rates of 40%-81%, for different SUSAS data sets and
40%-53.4% for the ORI data base.
1. Introduction
1.1. Emotion and stress classification
Speech is the fundamental medium of human
communication, and it is not just the sounds and words
that are important; in all speech human emotion is
expressed and that emotion is a vital part of
communication. Just as effective human-to-human
communication is virtually impossible without speakers
being able to detect and understand each other's
emotions, human-machine communication suffers from
significant inefficiencies because machines cannot
understand our emotions or generate emotional
responses. Words are not enough to correctly understand
the mood and intention of a speaker and thus the
introduction of human social skills to human-machine
communication is of paramount importance. This can be
achieved by the researching and creating methods of
speech modeling and analysis that embrace the signal,
linguistic and emotional aspects of communication.
Prosodic features of speech produced by a speaker
being under stress or emotion vary from features
produced under the neutral condition. The most often
Pre-processing
Voiced Speech
Detection
Features
Generation
Classification
Classification
result
SUSAS:
SUSAS:
SUSAS:
mixed
vowels
SUSAS
Actual
Speech
under
Stress
ORI
database
2.1.1
SUSAS Database
The Speech under Simulated and Actual Stress
(SUSAS) [13] database comprises a wide variety of
acted and actual stresses and emotions. Only speech
recorded under actual stress conditions was used in this
study. The speech samples were selected from two the
Actual Speech under Stress domain which includes
speech recordings made by the passengers during rides
on a roller-coaster. This domain consisted of recordings
from 7 speakers (4 male and 3 female). The speakers
were reading words from the 35 word list. The amount
of stress was subjectively determined by the position of
the roller-coaster during the time when the recording
was made. In total of 3179 speech recordings including
1202 recordings representing the high stress, 1276
recordings representing the moderate stress and 701
recordings representing the neutral speech were used in
this study.
2.1.2
ORI Database
A soundtrack of video recordings from the Oregon
Research Institute (ORI) [14] was used to select speech
samples for processing. The data included 71 parents
(27 mothers and 44 fathers) video recorded while being
engaged in a family discussion with their children.
During the discussion the family was asked to discuss
different problem solving tasks. The videotapes were
annotated by a trained psychologist based on both
speech and facial expressions and using the Living in
Family Environments (LIFE) coding system [15]. The
Adobe Pro software was applied to convert the video
files into audio files with a sampling frequency of 8
kHz. Each class (angry, happy, anxious, dysphoric and
neutral) was represented by 200 utterances (100 with
male and 100 with female speech). The average duration
of each utterance was 3 seconds. A neutral voice tone
has an even, relaxed quality without marked stress on
individual syllables. The anger communicates
displeasure, irritation, annoyance or frustration. A
subject reflects happiness when the voice is highpitched, or has a sing-song tone, that is not whining.
E i =
Nf
Nt
y =1
x =1
s ( x, y )
(1)
Spectrogram
speech Calculation
Average Energy
CB, Bark
or ERB bands
voiced
speech
Spectrogram
Calculation
Single Log-Gabor
Filter
Optimal features
selection
Spectrogram
speech Calculation
12 Log-Gabor
Filters
Averaging
Optimal Features
Selection
energy
filters.
........
Average Energy
voiced
Spectrogram
speech Calculation
Patch
Extraction
12 Log-Gabor
Filters
Averaging
12 Log-Gabor
Filters
Averaging
12 Log-Gabor
Filters
Averaging
Optimal Features
Selection
2.3. Pre-processing
Both, the SUSAS and ORI data sets were recorded in
real-life noisy conditions. To reduce the background
noise, a wavelet-based method developed by Donoho
[16] was applied. Speech signals of length N and
standard deviation were decomposed using the
wavelet transform with the mother wavelet db2 up to the
second
level,
and
the
universal
threshold
= 2 log( N ) was applied to each wavelet sub-band.
The signal was then reconstructed using the inverse
wavelet transform (IWT). The voiced speech was
extracted using a rule based adaptive endpoint detection
method described in [20].
Gangular ( r ) = exp ( 0 ) 2 2 2
2
2 radial
(3)
m, n
m,n
(7)
(4)
In Eq.(1-3), (r,) are the polar coordinates, f0
represents the central filter frequency, 0 is the
orientation angle, r and represent the scale
bandwidth and angular bandwidth respectively. The
number of different wavelengths r (scales) for the filter
bank was set to Nr=2, and for each wavelength of the
filter the number of different orientations was set to
N=6. This produced a bank of 12 log-Gabor filters
{G1,G2,,G12} with each filter representing different
scale and orientation.
The log-Gabor feature representation |S(x,y)|m,n of a
magnitude spectrogram s(x,y) was calculated as a
convolution operation performed separately for the real
and imaginary part of the log-Gabor filters:
Re(S ( x, y ) )m ,n = s ( x, y ) * Re(G (rm , n )
(5)
2.8. Classification
The GMM method [1] is widely used in computational
pattern recognition. Each class is represented by a
Gaussian mixture and referred to as a class model
un>=0, n=1,2,3,..., where n is a class index.
The complete class model un is a weighted sum of M
component densities:
M
p(u | , , p) = pi bi ( )
(10)
i =1
)
( )
ERB
81.82
79.09
70.69
70.63
53.40
vowels
(m)
(n,deg)
54.55 50.42 52.13
53.10
1,00
2,180
62.22 56.12 56.29
57.90
3,360
57.58 58.78 60.20
58.36
1
4,540
54.55 58.55 57.44
52.83
5,720
54.75 56.24 59.80
58.64
6,900
59.40 53.94 53.91
56.04
50.50 51.40 48.30
47.51
1,00
2,180
52.32 52.12 49.43
48.87
3,360
58.38 52.85 58.88
58.18
2
4,540
63.43 59.39 59.37
59.20
5,720
58.99 58.42 57.82
58.93
6,900
49.09 52.24 54.17
51.64
ORI
42.30
42.20
43.70
42.10
42.70
44.60
38.40
38.20
36.10
46.50
43.60
41.00
4. Conclusions
We have presented and tested a number of new
approaches to the feature selection based on analysis of
speech spectrograms. Two of these approaches, the ERB
bands method and the averaging of 12 log-Gabor filters
showed promising results in the process of automatic
stress and emotion classification in speech.
Our results showed significantly lower classification
rates for the ORI data base when compared to the data
obtained from the SUSAS sets. This can be attributed to
the different environments in which these two data bases
were recorded. The SUSAS data base was generated for
the purpose of research on stress and emotion detection,
and contains speech recordings made during a
rollercoaster ride when a very strong stress or emotion
expression can be expected. The ORI data on the other
hand, is a clinical data base containing spontaneously
expressed emotions during typical family based
conversation when, the emotional expressions are not
expected to be as strong as in the situations captured by
the SUSAS data.
In all approaches, the highest classification accuracy
was achieved while using single vowels, which is not
surprising since vowels are distinguished by
characteristic patterns of spectrograms. The ORI data
was classified using voiced speech extracted from
speech utterances containing a number of words. It is
possible that the results for the ORI data could be
improved if instead of voiced speech detection, an
automatic detection of particular vowels is used and the
features are then extracted from spectrograms
representing these vowels.
5. References
[1] Quatieri T.F.,Discrete-Time Speech Signal Processing
Prentice Hall PTR 2002.
[2] He L., Lech M., Maddage N., and Allen N., Emotion
Recognition in Speech of Parents of Depressed
Adolescents, iCBBE 2009.
[3] He L., Lech M., Maddage N., Memon S., and Allen N.,
Emotion Recognition in Spontaneous Speech within
Work and Family Environments, iCBBE 2009.
[4] He L., Lech M., Memon S., and Allen N., 2008,
Recognition of Stress in Speech Using Wavelet Analysis
and Teager Energy Operator, Interspeech 2008.
[5] Ezzat T., Tomaso Poggio T. Discriminative WordSpotting Using Ordered Spectro-Temporal Patch
Features, to appear, SAPA workshop, Interspeech,
Brisbane, Australia, 2008.
[6] Bouvrie J., Ezzat T., Poggio T., Localized SpectroTemporal Cepstral Analysis of Speech, ICASSP, Las
Vegas, Nevada, 2008.