2009-Stress and Emotion Recognition Using Log-Gabor Filter Analysis of Speech Spectrograms

Stress and Emotion Recognition Using Log-Gabor Filter Analysis of Speech
Spectrograms
Ling He, Margaret Lech, Namunu Maddage
School of Electrical and Computer Engineering
RMIT University, Australia
Nicholas Allen
Department of Psychology
The University of Melbourne, Australia
ling.he@student.rmit.edu.au
nba@unimelb.edu.au
Abstract
We present new methods that extract characteristic
features from speech magnitude spectrograms. Two of
the presented approaches have been found particularly
efficient in the process of automatic stress and emotion
classification. In the first approach, the spectrograms
are sub-divided into ERB frequency bands and the
average energy for each band is calculated. In the
second approach, the spectrograms are passed through
a bank of 12 log-Gabor filters and the outputs are
averaged and passed through an optimal feature
selection procedure based on mutual information
criteria. The proposed methods were tested using single
vowels, words and sentences from SUSAS data base
with 3 classes of stress, and spontaneous speech
recordings made by psychologists (ORI) with 5
emotional classes. The classification results based on
the Gaussian mixture model show correct classification
rates of 40%-81%, for different SUSAS data sets and
40%-53.4% for the ORI data base.
1. Introduction
1.1. Emotion and stress classification
Speech is the fundamental medium of human
communication, and it is not just the sounds and words
that are important; in all speech human emotion is
expressed and that emotion is a vital part of
communication. Just as effective human-to-human
communication is virtually impossible without speakers
being able to detect and understand each other's
emotions, human-machine communication suffers from
significant inefficiencies because machines cannot
understand our emotions or generate emotional
responses. Words are not enough to correctly understand
the mood and intention of a speaker and thus the
introduction of human social skills to human-machine
communication is of paramount importance. This can be
achieved by the researching and creating methods of
speech modeling and analysis that embrace the signal,
linguistic and emotional aspects of communication.
Prosodic features of speech produced by a speaker
being under stress or emotion vary from features
produced under the neutral condition. The most often
978-1-4244-4799-2/09/$25.00 2009 IEEE
observed changes include changes in the utterance

duration, decrease or increase of pitch, and shift of
formant frequencies.
An automatic recognition and classification of speech
under stressful conditions has applications in behavioral
and mental health sciences, human to machine
communication, robotics, and medicine.
Stress and emotion classification in speech are
computational tasks consisting of two major parts:
feature extraction and feature classification. The
majority of recent studies [12,18] focus on the acoustic
features derived from linear models of speech
production. Features that are most often used include:
pitch features (F0), spectral features (formants) and
intensity features (energy). There are also studies
proposing features such as linear predictive cepstral
coefficients (LPCC) [11] and mel frequency cepstral
coefficients (MFCC) [5]. Classification methods used in
stress recognition include: the Gaussian mixture model
(GMM) [17], the hidden Markov model (HMM) [5] and
various neural network systems [6].
1.2. Advantages of using speech spectrograms

A 2D narrowband magnitude spectrogram s(x,y) is a
graphical display of the squared magnitude of the timevarying spectral characteristics of speech [1]. It is
compact and highly efficient representation carrying
information about parameters such as energy, pitch F0,
formants and timing. These parameters are the acoustic
features of speech most often used in automatic stress
and emotion recognition systems [2,3,4]. Majority of
these systems analyze each parameter separately and
then combine them into a set of feature vectors. It
appeared to us that by analyzing a spectrogram one
could capture all of these characteristics at once and
preserve the important underlying dependencies
between different parameters.
The additional advantage is that by analyzing a
speech spectrogram, it is more likely to preserve and
take into account speech features that are caused by
specific psycho-physiological effects appearing under
certain emotions and stress. In the previous studies [8,9]
Kleinschmidt et al. applied a 2D Gabor filter bank to
mel-spectrograms. The resulting outputs of the Gabor
filters were concatenated into one-dimensional vectors
and used as features in the speech recognition
experiments. Chih et al. [10] applied a similar method in
speech discrimination and enhancement. In recent

studies Ezzat et al. [5,6,7] described a 2D Gabor filter
bank decomposing localized patches of spectrograms
into components representing speech harmonicity,
formants, vertical onsets/offsets, noise and overlapping
simultaneous speakers. In recent studies of automatic
stress recognition by He et al., [12], speech
spectrograms were used to derive features extracted
from the sigma-pi cells. The analysis was performed at
three alternative sets of frequency bands: critical bands,
Bark scale bands and equivalent rectangular bandwidth
(ERB) scale bands. The ERB scale provided the highest
correct classification rates ranging from 67.84% to
73.76%. The classification results did not differ between
data sets containing specific types of vowels and data
sets containing mixtures of vowels. This indicates that
the proposed method can be applied to voiced speech in
speech independent conditions. Following this line of
research we will propose four new methods of feature
extraction from speech spectrograms and apply them to
an automatic recognition of emotion and stress in
speech. We will test our methods using speech recording
containing spontaneous (not acted) emotions. In
contrary to previous methods we will use the log-Gabor
filters instead of the linear Gabor filters. We will then
demonstrate that two of our approaches provide
relatively high classification rates. A general flowchart
of our classification system is depicted in Fig. 1. Section
2 explains details of the feature extraction and
classification stages. In Section 3 the classification
results are presented, and finally Section 4 summarizes
the conclusions.
speech
Pre-processing
Voiced Speech
Detection
Features
Generation
Classification
Classification
result
Fig. 1. Features generation and classification system.
2. Emotion classification system

2.1. Speech data
Two data bases of spontaneous speech recordings
were used: the SUSAS data which is widely used in the
emotion recognition studies and the ORI data originally
created by psychologists for the purpose of behavioral
studies. The speech was sampled by a 16-bit A/D
converter with 8 kHz sampling rate. The feature
extraction was done using voiced speech The data was
divided into five sets described in Table 2. For each data
set, 80% of all recordings were randomly selected as the
training data, and the remaining 20% were used as the
testing data. For each dataset and for each type of
feature selection method, the classification process was
performed 15 times. An average percentage of
identification accuracy was then calculated over the 15
runs.
Datasets
TABLE 1: DESCRIPTION OF DATA SETS

Words used in datasets
Number of samples
High
Low
Neu
SUSAS:
SUSAS:
SUSAS:
mixed
vowels
SUSAS
Actual
Speech
under
Stress
ORI
database
stress stress tral

east, freeze, three
133
143
59
break, change, eight,
206
220
121
eighty, gain, strafe
break, change, east, eight,
871
931
523
fix, freeze, gain, go, help,
hot, mark, navy, no, oh,
on, out, point, six, south,
stand, steer, strafe, ten,
three, white, wide
break, change, degree,
1202
1276
701
hot, east, eight, eighty,
enter, fifty, fix, freeze,
gain, go, hello, help,
histogram, destination,
mark, nav, no, oh, on,
out, point, six, south,
stand, steer, strafe, ten,
thirty, three, white, wide,
zero
This database contains 5 different emotions (angry,
anxious, dysphoric, neutral and happy), for each
emotion, there are 200 recordings.
2.1.1
SUSAS Database
The Speech under Simulated and Actual Stress
(SUSAS) [13] database comprises a wide variety of
acted and actual stresses and emotions. Only speech
recorded under actual stress conditions was used in this
study. The speech samples were selected from two the
Actual Speech under Stress domain which includes
speech recordings made by the passengers during rides
on a roller-coaster. This domain consisted of recordings
from 7 speakers (4 male and 3 female). The speakers
were reading words from the 35 word list. The amount
of stress was subjectively determined by the position of
the roller-coaster during the time when the recording
was made. In total of 3179 speech recordings including
1202 recordings representing the high stress, 1276
recordings representing the moderate stress and 701
recordings representing the neutral speech were used in
this study.
2.1.2
ORI Database
A soundtrack of video recordings from the Oregon
Research Institute (ORI) [14] was used to select speech
samples for processing. The data included 71 parents
(27 mothers and 44 fathers) video recorded while being
engaged in a family discussion with their children.
During the discussion the family was asked to discuss
different problem solving tasks. The videotapes were
annotated by a trained psychologist based on both
speech and facial expressions and using the Living in
Family Environments (LIFE) coding system [15]. The
Adobe Pro software was applied to convert the video
files into audio files with a sampling frequency of 8
kHz. Each class (angry, happy, anxious, dysphoric and
neutral) was represented by 200 utterances (100 with
male and 100 with female speech). The average duration
of each utterance was 3 seconds. A neutral voice tone
has an even, relaxed quality without marked stress on
individual syllables. The anger communicates
displeasure, irritation, annoyance or frustration. A
subject reflects happiness when the voice is highpitched, or has a sing-song tone, that is not whining.
Speech is faster or louder than usual, but not angry. An

anxious state is expressed when the speaker
communicates
anxiety,
nervousness,
fear
or
embarrassment.
An
elevated
voice
volume,
accompanied by rapid speech is common. A dysphoric
state is evident when the subject communicates sadness
and depression with a low voice tone and slow pace of
speech.
We have tested and compared four spectrogram-based

approaches to the features selection process.
In the first approach (Fig.2), the 2D spectrograms
were divided into sub-bands based on three different
auditory scales: critical bands, Bark scale, and ERB
scale. For each sub-band a single value of an average
E i =
E i (i=1,...,N) was calculated using:

1
N f Nt

Nf
Nt
y =1
x =1
s ( x, y )
(1)
Where s(x,y) are the spectrogram values (squared

magnitudes) at the time coordinates x and frequency
coordinates y, Nf is the total number of frequency
coordinates, Nt is the total number of time coordinates,
and N is the total number of frequency bands (N=16 for
critical bands, N=17 for Bark scale, N=27 for ERB scale
[21]). The resulting feature values were then
concatenated into 1D vectors, and passed to the
Gaussian mixture model (GMM) for modeling and
classification.
In the second approach (Fig.3), the speech
spectrograms were passed thought a bank of 12 logGabor filters with 2 different scales and 6 different
orientations. For each filter the magnitudes of the filter
outputs were passed through an optimal feature selection
algorithm based on mutual information (MI) criteria,
and then used in modeling and classification, separately
for each filter. The results allowed determining which
scales and orientations of log-Gabor filters give the best
classification scores.
In the third approach (Fig.4), the speech spectrograms
were passed thought a bank of 12 log-Gabor filters,
averaged, passed thought the MI feature selection, and
sent to the GMM.
In our final approach (Fig.5), three patches were
extracted from each spectrogram. The first patch
included frequencies from 0.05Hz to 355.2Hz, the
second patch, from 355.2Hz to 1290.7Hz, and the third
patch, from 1290.7Hz to 3754.5Hz. Each patch was
passed through 12 log-Gabor filters. The outputs were
averaged and passed thought the MI feature selection
algorithm.
voiced
Spectrogram
speech Calculation
Average Energy
CB, Bark
or ERB bands
voiced
speech
Spectrogram
Calculation
Single Log-Gabor
Filter
Optimal features
selection
Fig. 3. Features generation using single log-Gabor

voiced
Spectrogram
speech Calculation
12 Log-Gabor
Filters
Averaging
Optimal Features
Selection
Fig. 4. Features generation using averaged outputs from logGabor filters.
2.2. Features selection methods
energy
filters.
........
Average Energy
Fig. 2. Features generation using auditory frequency bands.
voiced
Spectrogram
speech Calculation
Patch
Extraction
12 Log-Gabor
Filters
Averaging
12 Log-Gabor
Filters
Averaging
12 Log-Gabor
Filters
Averaging
Optimal Features
Selection
Fig. 5. Features generation using spectrogram patches and logGabor filters.
2.3. Pre-processing
Both, the SUSAS and ORI data sets were recorded in
real-life noisy conditions. To reduce the background
noise, a wavelet-based method developed by Donoho
[16] was applied. Speech signals of length N and
standard deviation were decomposed using the
wavelet transform with the mother wavelet db2 up to the
second
level,
and
the
universal
threshold
= 2 log( N ) was applied to each wavelet sub-band.
The signal was then reconstructed using the inverse
wavelet transform (IWT). The voiced speech was
extracted using a rule based adaptive endpoint detection
method described in [20].
2.4. Spectrogram calculation

Narrow band spectrograms were calculated using
short-time spectral analysis applied to 256-point frames
of voiced speech with 196-point overlap. The energy
spectral density was calculated using the FFT algorithm
and only the signals with SNR greater or equal to 50 dB
was kept in the spectrogram. Everything with SNR
below 50 dB was removed [8,9]. Fig. 5 shows examples
of spectrograms for the same vowel under different
stress levels. It was observed that, with increasing level
of stress, the spectrograms revealed increasing formant
energy in the higher frequency bands, as well as clearly
increasing pitch for high level stress. Other acoustic
information, such as the formants also vary under
different levels of stress. These observations indicate
that the spectrograms contain important characteristics
that can be used to differentiate between different levels
of stress.
Gangular ( r ) = exp ( 0 ) 2 2 2
Fig. 5. Examples of spectrograms for vowel for soft speech

(a), neutral speech (b), speech under low level stress (c) and
speech under high level stress (d).
2.5. Log-Gabor filters
2
2 radial
(3)
and Gangular(r) represents the frequency response of the

angular filter component, given as:
Im(S ( x, y))m,n = s( x, y) * Im G(rm , n

(6)
Where (x,y) represent the time and frequency
coordinates of a spectrogram, and m=1,,Nr=2 and
n=1,,N=6. This was followed by the magnitude
calculation for the filter bank outputs,
S ( x, y ) m , n =
(Re(S ( x, y)) ) + (Im(S ( x, y))

2
m, n
m,n
(7)
2.6. Averaging outputs of log-Gabor filters

For each spectrogram, the filter bank outputs were
averaged to produce a single output array:
1
N ,N
| S ( x, y ) |=
(8)
m=r 1 S ( x, y) m,n
N r N n=1
The averaged arrays were then converted to 1D
vectors via row-by-row concatenation.
2.7. Optimal feature selection using mutual

information criteria
The total set of NF feature vectors was reduced to a
small sub-set of Ns< NF vectors selected using the
mutual information feature selection algorithm
described in [19]. The mutual information represents a
measure of information found commonly in two random
variables X and Y, and it is given as:
p ( x, y)
I ( X ; Y ) = p( x, y) log
(9)
p
( x) p( y )
x y
Where p(x) is the probability density function (pdf),
defined as p(x) = Pr{X=x}, and p(x,y) is the joint pdf
defined as p(x,y) = Pr(X=x and Y=y). Given an initial
set F with NF feature vectors and a set C of all output
classes (C={1,2,3} for SUSAS data and C={1,2,3,4,5}
for ORI data), the aim was to find an optimal subset S
with NS < NF feature vectors. Starting from the empty
set, the best available feature vectors were added, one by
one to the selected feature set, until the size of the set
reached the desired value of NS. The sub-set S of feature
vectors
was
selected
through
simultaneous
maximization of the mutual information between the
Y
Gradial (r ) = exp log(r f 0 )
(4)
In Eq.(1-3), (r,) are the polar coordinates, f0
represents the central filter frequency, 0 is the
orientation angle, r and represent the scale
bandwidth and angular bandwidth respectively. The
number of different wavelengths r (scales) for the filter
bank was set to Nr=2, and for each wavelength of the
filter the number of different orientations was set to
N=6. This produced a bank of 12 log-Gabor filters
{G1,G2,,G12} with each filter representing different
scale and orientation.
The log-Gabor feature representation |S(x,y)|m,n of a
magnitude spectrogram s(x,y) was calculated as a
convolution operation performed separately for the real
and imaginary part of the log-Gabor filters:
Re(S ( x, y ) )m ,n = s ( x, y ) * Re(G (rm , n )
(5)
Gabor filters are commonly recognized [17] as one of

the best choices for obtaining features in image
classification. They offer an excellent simultaneous
localization of spatial and frequency information.
However, the maximum bandwidth of a Gabor filter is
limited to approximately one octave and Gabor filters
are not optimal if one is seeking broad spectral
information with maximal spatial localization. As an
alternative to the Gabor filters the log-Gabor filters were
proposed by Field [18]. Log-Gabor filters can be
constructed with arbitrary bandwidth and the bandwidth
can be optimized to produce a filter with minimal spatial
extent. The log-Gabor filters have Gaussian transfer
functions when viewed on the logarithmic frequency
scale, whereas the Gabor filters have Gaussian transfer
functions when viewed on the linear frequency scale. It
was therefore postulated that the log-Gabor functions
having extended tails at the high frequency ends should
be able to encode natural images more efficiently by
better representing the higher frequency components.
The transfer functions of log-Gabor filters are
compatible with the human visual system, which has
cell responses that are symmetric on the log frequency
scale Furthermore, a log-Gabor Filter always has a zero
DC component and therefore, the filter bandwidth can
be optimized to produce a filter with minimal spatial
extent.
The log-Gabor filters in the frequency domain can be
defined in polar coordinates by the transfer function
G(r,) constructed as the following product:
G (r , ) = Gradial (r ) Gangular (r )
(2)
where Gradial(r) is the frequency response of the radial
component given as:
selected feature vectors in S and the class labels C, and

minimization of the mutual information between the
selected feature vectors within S. As a result an optimal
sub-set S of mutually independent and highly
representative feature vectors was obtained.
Given the full set size of NF=513 (using SUSAS data)
we tested the classification process using optimal subset sizes of Ns=10, 20, 30, 40, 50, 60 and 70. The results
showed that Ns=10 gives the best compromise between
the classification accuracy and data reduction.
2.8. Classification
The GMM method [1] is widely used in computational
pattern recognition. Each class is represented by a
Gaussian mixture and referred to as a class model
un>=0, n=1,2,3,..., where n is a class index.
The complete class model un is a weighted sum of M
component densities:
M
p(u | , , p) = pi bi ( )
(10)
i =1
where pi>=0, i=1,2,...,M are the mixture weights, and

bi>=0, i=1,2,,...,M are the Gaussian densities with a
mean vector i and a covariance matrix i .
Each set of Gausian mixture coefficients:
= (b, 1 , 1 ,..., M , M ) is estimated using the
expectation maximization (EM) algorithm applied to a
training dataset.
When classifying a speech utterance x test
from a test
k
dataset, the probability
p x test
k | u n P (u n )
P(x test
(11)
k | un ) =
P x test
k
for each class is calculated and the test utterance is
assigned to the class which gives the maximum
probability.
The classification score for both classifiers was
calculated as an average percentage of identification
accuracy (APIA), defined as follows:
1 NC
100%
APIA =
(12)
N r NT
Where NC is the number of test inputs correctly
identified, NT is the total number of test inputs, and Nr is
the number of algorithm executions (runs).
)
( )
3. Classification results and discussion

The first approach to feature generation produced
results illustrated in Table 2. It shows that the best
classification rates were obtain while dividing the
spectrograms into ERB bands and then calculating
average energy for each sub-band. Subdivision into
Bark scale band produced lower results, and the worse
results were obtained while using critical bands.
Due to differences in bandwidth definition, the ERB
bands are narrower than the classical critical bands at all
frequencies. These results indicate that characteristic
features are located both at high and low frequencies
and fine division across the whole bandwidth of speech
signal is essential in the process of stress and emotion

recognition. Unlike CB or Bark scales, the ERB-scale
does not only represent the tonotopic structure of human
auditory system but also incorporates a temporal
analysis that contributes to frequency resolution for low
frequencies [11].
TABLE 2. APIA% FOR CB, BARK, AND ERB FEATURES.
Frequency bands
Dataset
CB
Bark
SUSAS: vowels
75.76 79.39
SUSAS: vowels
66.55 70.73
SUSAS: Mixed vowels
58.41 62.24
SUSAS: Actual Speech under Stress
59.28 62.96
ORI database
48.80 51.40
ERB
81.82
79.09
70.69
70.63
53.40
TABLE 3. APIA% FOR EACH OF 12 LOG-GABOR FILTERS

Orienta
Mixed
Scale
SUSAS
tion
vowels
(m)
(n,deg)
54.55 50.42 52.13
53.10
1,00
2,180
62.22 56.12 56.29
57.90
3,360
57.58 58.78 60.20
58.36
1
4,540
54.55 58.55 57.44
52.83
5,720
54.75 56.24 59.80
58.64
6,900
59.40 53.94 53.91
56.04
50.50 51.40 48.30
47.51
1,00
2,180
52.32 52.12 49.43
48.87
3,360
58.38 52.85 58.88
58.18
2
4,540
63.43 59.39 59.37
59.20
5,720
58.99 58.42 57.82
58.93
6,900
49.09 52.24 54.17
51.64
ORI
42.30
42.20
43.70
42.10
42.70
44.60
38.40
38.20
36.10
46.50
43.60
41.00
TABLE 4. APIA% FOR AVERAGED OUTPUTS OF 12 LOG-GABOR FILTERS

AND 3 SPECTROGRAM PATCHES.
APIA%
Av 12
3 Spectrog.
Dataset
Filters
Patches
73.33
SUSAS: vowels
77.58
65.82
SUSAS: vowels
79.03
61.77
SUSAS: Mixed vowels
73.76
61.76
SUSAS: Actual Speech under Stress
64.7
40.80
ORI database
39.60
Table 3 shows, the classification rates for each of the

12 log-Gabor filters used in our second approach to the
features extraction. The classification results based on
single filters are relatively low, ranging from 48% to
63% for the SUSAS data and from 36% to 46,5% for the
ORI data. The best performing filters appear to be (m=2,
n=4 (540)) and (m=1, n=3(360)). These results suggested
that, the features should be extracted either from all logGabor filters or from a selected group of best
performing filters.
In our third approach, we have therefore averaged the
outputs of 12 log-Gabor filters and then selected a small
sub-set of features using the mutual information criteria.
The classification results in this case, are presented in
Table 4 clearly showing an improvement over a single
log-Gabor filter method. For the averaged log-Gabor
filters, the results are close to those obtained in our first
approach with the ERB frequency bands and range from
64% to 77.6% for the SUSAS data and 39.6% for the
ORI data.
Table 4 also shows the results obtained while
applying 12 log-Gabor filters to 3 spectrogram patches

representing low, medium and high frequency bands.
The results in this case are slightly lower than for the
averaged 12 log-Gabor filters and range from 61.7% to
73% for the SUSAS data and 40.8% for the ORI data.
4. Conclusions
We have presented and tested a number of new
approaches to the feature selection based on analysis of
speech spectrograms. Two of these approaches, the ERB
bands method and the averaging of 12 log-Gabor filters
showed promising results in the process of automatic
stress and emotion classification in speech.
Our results showed significantly lower classification
rates for the ORI data base when compared to the data
obtained from the SUSAS sets. This can be attributed to
the different environments in which these two data bases
were recorded. The SUSAS data base was generated for
the purpose of research on stress and emotion detection,
and contains speech recordings made during a
rollercoaster ride when a very strong stress or emotion
expression can be expected. The ORI data on the other
hand, is a clinical data base containing spontaneously
expressed emotions during typical family based
conversation when, the emotional expressions are not
expected to be as strong as in the situations captured by
the SUSAS data.
In all approaches, the highest classification accuracy
was achieved while using single vowels, which is not
surprising since vowels are distinguished by
characteristic patterns of spectrograms. The ORI data
was classified using voiced speech extracted from
speech utterances containing a number of words. It is
possible that the results for the ORI data could be
improved if instead of voiced speech detection, an
automatic detection of particular vowels is used and the
features are then extracted from spectrograms
representing these vowels.
5. References
[1] Quatieri T.F.,Discrete-Time Speech Signal Processing
Prentice Hall PTR 2002.
[2] He L., Lech M., Maddage N., and Allen N., Emotion
Recognition in Speech of Parents of Depressed
Adolescents, iCBBE 2009.
[3] He L., Lech M., Maddage N., Memon S., and Allen N.,
Emotion Recognition in Spontaneous Speech within
Work and Family Environments, iCBBE 2009.
[4] He L., Lech M., Memon S., and Allen N., 2008,
Recognition of Stress in Speech Using Wavelet Analysis
and Teager Energy Operator, Interspeech 2008.
[5] Ezzat T., Tomaso Poggio T. Discriminative WordSpotting Using Ordered Spectro-Temporal Patch
Features, to appear, SAPA workshop, Interspeech,
Brisbane, Australia, 2008.
[6] Bouvrie J., Ezzat T., Poggio T., Localized SpectroTemporal Cepstral Analysis of Speech, ICASSP, Las
Vegas, Nevada, 2008.
[7] Ezzat T., Bouvrie J., Poggio T., Spectro-Temporal

Analysis of Speech Using 2-D Gabor Filters Interspeech,
Antwerp, Belgium 2007.
[8] Kleinschmidt M. 2001. Methods for capturing spectrotemporal modulations in automatic speech recognition,
Acta Acustica 2001 (8).
[9] Kleinschmidt M, Hohmann V. 2003. Sub-band SNR
estimation using auditory feature processing, Speech
Communication, 39(1-2): 47-63.
[10] Chih T., Ru P., and Shamma S., Multiresolution
spectrotemporal analysis of complex sounds, Journal of
the Acoustical Society of America, vol. 118, pp. 887
906, 2005.
[11] Hanchuan, P., L. Fuhui, et al. (2005). "Feature selection
based on mutual information criteria of max-dependency,
max-relevance, and min-redundancy." IEEE PA&MI,
27(8): 1226-1238.
[12] He L., Lech M., Maddage N., and Allen N., Stress
Detection Using Speech Spectrograms and Sigma-pi
Neuron Units, ICNC09-FSKD09.
[13] Hansen J.H.L., Sahar Bou-Ghazale. 1997. Getting
Started with SUSAS: A Speech Under Simulated and
Actual Stress Database. EUROSPEECH-97.
[14] Davis B., Sheeber L, Hops H., and Tildesley E..,
Adolescent Responses to Depressive Parental Behaviors
in Problem-Solving Interactions: Implications for
Depressive Symptoms, Journal of Abnormal Child
Psychology, vol. 28, No. 5, 2000, pp. 451-465.
[15] Longoria N., Sheeber L., and Davis B., Living in Family
Environments (LIFE) Coding. A Reference Manual for
Coders. Oregon Research Institute, 2006.
[16] Donoho D. L., 1995, Denoising by soft thresholding,
IEEE T. on Inf. Th., vol. 41: no3, pp. 613-627.
[17] Lajevardi S.M., and Lech M., Facial Expression
Recognition Using a Bank of Neural Networks and
Logarithmic Gabor Filters, DICTA 2008.
[18] Field D.J. Relations between the statistics of natural
images and the response properties of cortical cells, Jour.
of the Optical Society of America, pp. 2379-2394, 1987.
[19] Kwak N., Choi C., Input Feature Selection for
Classification Problems. IEEE Trans. On Neural
Networks, vol.13, no.1, January 2002, pp.143-159.
[20] Lynch, Jr., Josenhans, J., Crochiere, R., Speech/Silence
segmentation for real-time coding via rule based adaptive
endpoint detection, In: IEEE International Conference
on Acoustics, Speech, and Signal Processing, vol. 12,
pp.1348-1351, 1987.
[21] B.C.J. Moore and B.R. Glasberg (1983) "Suggested
formulae for calculating auditory-filter bandwidths and
excitation patterns" J. Acoust. Soc. Am. 74: 750-753.

2009-Stress and Emotion Recognition Using Log-Gabor Filter Analysis of Speech Spectrograms

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

2009-Stress and Emotion Recognition Using Log-Gabor Filter Analysis of Speech Spectrograms

Hochgeladen von

Copyright:

Verfügbare Formate

Stress and Emotion Recognition Using Log-Gabor Filter Analysis of Speech

978-1-4244-4799-2/09/$25.00 2009 IEEE

observed changes include changes in the utterance

1.2. Advantages of using speech spectrograms

speech discrimination and enhancement. In recent

Fig. 1. Features generation and classification system.

2. Emotion classification system

TABLE 1: DESCRIPTION OF DATA SETS

stress stress tral

Speech is faster or louder than usual, but not angry. An

We have tested and compared four spectrogram-based

E i (i=1,...,N) was calculated using:

Where s(x,y) are the spectrogram values (squared

Fig. 3. Features generation using single log-Gabor

Fig. 4. Features generation using averaged outputs from logGabor filters.

2.2. Features selection methods

Fig. 2. Features generation using auditory frequency bands.

Fig. 5. Features generation using spectrogram patches and logGabor filters.

2.4. Spectrogram calculation

Fig. 5. Examples of spectrograms for vowel for soft speech

2.5. Log-Gabor filters

and Gangular(r) represents the frequency response of the

Im(S ( x, y))m,n = s( x, y) * Im G(rm , n

(Re(S ( x, y)) ) + (Im(S ( x, y))

2.6. Averaging outputs of log-Gabor filters

2.7. Optimal feature selection using mutual

Gradial (r ) = exp log(r f 0 )

Gabor filters are commonly recognized [17] as one of

selected feature vectors in S and the class labels C, and

where pi>=0, i=1,2,...,M are the mixture weights, and

3. Classification results and discussion

signal is essential in the process of stress and emotion

TABLE 3. APIA% FOR EACH OF 12 LOG-GABOR FILTERS

TABLE 4. APIA% FOR AVERAGED OUTPUTS OF 12 LOG-GABOR FILTERS

Table 3 shows, the classification rates for each of the

applying 12 log-Gabor filters to 3 spectrogram patches

[7] Ezzat T., Bouvrie J., Poggio T., Spectro-Temporal

Das könnte Ihnen auch gefallen