Sie sind auf Seite 1von 4

EUROCON 2005 Serbia & Montenegro, Belgrade, November 22-24, 2005

Time-Frequency Based Approach for Analysis


and Synthesis of Emotional Voice
Mamoru Kobayashi and Shigeo Wada, Member, IEEE

In [6], a statistical method with higher discrimination rate


Abstract In this paper, analysis and synthesis methods of using both pitch and phoneme is proposed. Those
emotional voice for man-machine natural interface is approaches are useful to analysis and synthesis of
developed. First, the emotional voice (neutral, anger, emotional voice in certain conditions. However, the
sadness, joy, dislike) is analyzed using time-frequency methodologies are not universal to show the effectiveness
representation of speech and similarity analysis. Then, of the entire cases. The valuable method for emotional
based on the result of emotional analysis, a voice with voice analysis and synthesis is required to be established.
neutral emotion is transformed to synthesize the particular In this paper, emotional voice analysis and synthesis
emotional voice using time-frequency modifications. In the methods using time-frequency based approach are
simulations, five types of emotion are analyzed using 50 investigated. To obtain the precise time-frequency
samples of speech signals. The satisfactory average representation of emotional voice, a combined multiple
discrimination rate is achieved in the similarity analysis. short-time Fourier transform (MSTFT) shown in [7] is
Further, the synthesized emotional voice is subjectively utilized. The practical MSTFT using minimum mean
evaluated. It is confirmed that the emotional voice is square criteria can be a higher resolution time-frequency
naturally generated by the proposed time-frequency based analysis compared with the conventional STFT and wavelet
approach. transform [7], [8]. As the detail of frequency variation can
be expressed with fine resolution, it is suitable to apply
Keywords - Emotional voice, Speech processing, emotional voice analysis. Using the extracted emotional
Time-frequency analysis features such as pitch information (frequency, amplitude)
of five types of emotions, similarity analysis is executed.
Next, based on the analysis results, three types of emotional
I. INTRODUCTION voice are experimentally synthesized by transforming
T he man-machine interaction using voice is an attractive time-frequency representation of a neutral voice.
tool compared with the traditional mouse based In section 2, an analysis method of emotional voice is
interface. As the emotion accompanying with voice is shown. Then, a synthesis method of emotional voice based
important aspect in human communication, if the emotional on time-frequency approach is explained in section 3. The
voice is grasped and generated by a computer at the simulation results of analysis and synthesis using five types
interface, a natural interaction can be realized. Further, the of basic emotional voice (neutral, anger, sadness, joy,
emotional analysis is valuable to improve the accuracy in dislike) are shown in section 4. Section 5 is the conclusion.
speech recognition and synthesis.
A number of algorithms and methods with regard to
analysis and synthesis of emotional voice have been II. TIME-FREQUENCY ANALYSIS OF SPEECH SIGNALS
developed [1]-[6]. It is essentially difficult to analyze the
emotional voice completely, because the state of emotion is A. Time-Frequency Representation
complicated and related to a number of human factors such A definition of the combined MSTFT given in [7] is
as content of speech, gender, age, mentality and introduced. Let P prototype windows h (i) [k], k EZ,
personality. i= 0,1,... , P - 1 be given. Shifting the i-th prototype
Recently, the emotional voice is synthesized and analyzed
by the signal processing approach. In [1], the generation of window by nN in time and by 27rm in frequency, the
emotional voice is attempted using pitch, short-time energy
and voice speed those are calculated by the PARCOR following set of windows is obtained.
method. The emotional voice synthesis based on speaker's 2zz
intention [2] and behavior [3] are investigated. Further, as h(i) [k] h(i) [k -nN ]e mi
= (1)
for the emotional analysis, a method based on HMM model where m =0,1, ,Mi-1, neZ, and Ni,Mi EZ are
is proposed in [4]. The statistical approach such as
principal component analysis, discriminate analysis is also assumed to be the decimation factor and a number of
proposed in [5] to construct emotional model in database. frequency bands, respectively [8]. All the prototype
windows are assumed to satisfy finite length, h () [k] = 0
M. Kobayashi and S. Wada are the Graduate School of Engineering,
Tokyo Denki University, 2-2 Kanda-Nishiki-cho, Chiyoda-ku, Tokyo
for k #lo,Q -1.
101-8457, Japan (phone: +81-3-5280-3314; fax: +81-3-5280-3573;
e-mail: wada(cck.dendai.acjp).

1-4244-0049-X/05/$20.00 (C2005 IEEE 1642


Then, the MSTFT D() (x) for a signal x[k], associated
with a set of windows h(') [k] generated by P prototype
windows, is defined by
D(i)() = (x[k],h ( n[k]) (2)
The square absolute of the MSTFT can be a time-frequency
distribution to represent behavior of a signal. The shape of
window specifies resolutions of the MSTFT.
In order to obtain higher resolution with respect to time
and frequency over time-frequency plane, the MSTFTs are
combined. The method is a prescribed minimum mean
square error approach. The following minimization is
executed.
(a)
Smn = argmin 1 E 4Pm D- ) 2 (3)
P., i=O m,n
The distribution is obtained by
I P-1
Smn p mn (4)
i=O
Figure 1 shows an example of time-frequency
representation by Smn. The Kaiser windows with different
resolutions are used. As shown in [1], the pitch information
has sufficient information to represent emotion. The
sample voices uttered with the particular emotion are
analyzed by the MSTFT. The emotional features of pitch
are extracted based on the time-frequency representation (b)
and applied to similarity analysis. Figure 2 shows examples Fig. 2. Pitch variation of neutral emotional voice shown in
of pitch variation of a neutral voice shown in Fig. 1. Figure Fig. 1. (a) Frequency variation, (b) Amplitude variation.
2 (a) represents variation of pitch frequency. Figure 2 (b)
represents variation of pitch amplitude.
Here, the emotions of voice used in our research are B. Similarity Analysis
clarified. The neutral emotion is a voice emotion state Now, a method of similarity analysis is shown. The pitch
where the mind is quiet and calm. The anger is a voice variation information with respect to frequency and
emotion with loudly and rapid voice. The sadness is a voice amplitude is used as feature parameters. The distance
emotion with small and droop voice. The joy is a voice between two feature parameters is given by
emotion with cheerful and encouraged voice. The dislike is r.. (5)
a voice emotion with refusal state.
=
si-si .
Here, Eq. (5) satisfies rij = 0 and r.. = rii . Then, the
similarity between two parameters is defined by
r..
Zy
. = 1- (6)
max r..
When parameters are similar, the similarity Ai shown by
Eq. (6) becomes to one. The matrix whose elements are
composed of Eq. (6) is called similarity matrix. By
analyzing the similarity matrix using a threshold value, the
cluster is generated.

III. TIME-FREQUENCY BASED APPROACH FOR SYNETHESIS


OF EMOTIONAL VOICE
In this section, a synthesis method of emotional voice by
Fig. 1. Example of time-frequency representation of a time-frequency modification is shown.
neutral emotional voice. To apply the emotional synthesis, the feature is analyzed
as difference to the neutral voice. Based on the similarity
analysis results, the emotional features can be represented
by the qualitative vectors. To synthesize the voice
including the particular emotion to the neutral voice, the

1-4244-0049-X/05/$20.00 2005 IEEE 1643


time-frequency representation is modified based on the IV. SIMULATION RESULTS
emotional analysis. The sample voices used in the simulations are uttered by a
The qualitative features of the analyzed emotion in the man. To evaluate the discrimination by the similarity
simulation results are summarized in the following. analysis, a discriminate rate is introduced. The discriminate
<Anger> rate DR of emotional voices is defined by the following
The fluctuation of pitch frequency is large. equation.
The average of pitch frequency is slightly large.
The amplitude of pitch is large. S
DR (O/) (6)
The duration time of sentence is short. T+E
<Sadness> Here T represents the total number of sample voices in a
The fluctuation of pitch frequency is slow. certain emotional group. S represents the number of
The average of pitch frequency is small. correctly clustered sample voices and E represents the
The amplitude of pitch is small. number of fault sample voices.
The duration time of sentence is long. Tables 1, 2 and 3 show discrimination rates for emotional
<Joy> analysis. The threshold value is selected by 0.9. Table 1
The fluctuation of pitch frequency is large. shows discriminate rates of five types of emotional voice
The average of pitch frequency is large. signals. The voice sample, 'a/me/da' (It's rain) is used. 50
The pitch frequency for end of the sentence increases. number of sample voices are prepared. The number of each
The amplitude of pitch is large. emotion of the individual voice samples is 10. The feature
The duration time of sentence is long. vector used for analysis is the minimum, maximum and
average of pitch variation. In this case, the discrimination
The time-frequency modification is executed by rate of dislike and neutral emotions by maximum and
minimizing the following measure. average features is high. Moreover, the discrimination rate
of dislike and sadness emotions by the minimum feature is
also useful.
e= Cmn - a(n Cm -,8(n),n (8) Table 2 shows discrimination rates ofthe same emotional
Here Cmn and Cmn represent time-frequency sample voices using pitch amplitude features. In this case,
the discrimination of joy emotion by the maximum and
representation of pitches for representative of the identified average feature is high.
emotional class and a neutral voice. The parameters a(n) Finally, as feature parameters, the minimum and average
and 8(n) represent amplitude and frequency of pitch frequency variation, and the average of pitch
transformation amount parameter. In order to obtain voice amplitude are selected and combined to use in the similarity
signal, the modified time-frequency representation with the analysis. Table 3 shows the discrimination rate of five
parameters is transformed to time domain by the inverse types of emotional voices using the similarity matrix. In
STFT (ISTFT). Figure 3 shows a flow of emotional voice this case, sufficient average discrimination rate of 81.02
synthesis. It is noted that the ISTFT is represented by (%) can be achieved with a small number of parameters.
M-1 oo j27rn k Table 1: Discriminate rates of five types of emotional voices using
x[k] = rDmn [x]h[k-nN]e k pitch frequency of individual sentence.
m=m n=-oo Neutral Anger Sadness Joy Dislike
max 90.9 18.6 51.0 54.2 90.9
min 54.3 22.0 78 28.7 89.3
Speech (neutral) Emotional ave. 100 30.1 29.7 50 100
I Analysis
STFT Table 2: Discriminate rates of five types of emotional voices using
pitch amplitude of individual sentence.
Neutral Anger Sadness Joy Dislike
Time-frequency Pitch frequency max 35.0 21.8 24.7 88.2 35.0
representation Pitch amplitude min 31.6 22.4 38.5 32.9 30.3
ave. 33.3 43.1 55.6 87.0 33.3
ISTFT
Table 3: Overall discriminate rates of five types of emotional
voices using similarity matrix.
Emotional Speech Neutral | Anger Sadness I Joy Dislike
Rate 84.7 71.2 78.0 86.5 84.7
Fig. 3. Emotional voice synthesis by time-frequency
modification. Next, based on the emotional analysis results, the three
types of emotional voice (anger, sadness, joy) are
synthesized by transforming the neutral voice. Figure 4
shows an example of synthesis of emotional voice (anger).
Figure 4 (a) shows a time-frequency representation of
neutral voice. Figure 4 (b) shows a time-frequency

1-4244-0049-X/05/$20.00 c 2005 IEEE 1644


representation of anger voice. Figure 4 (c) shows a
time-frequency representation of synthesized voice.
The synthesized emotional voices are evaluated by
subjective listening. It is confirmed that the emotional
voices are natural to generate emotions. Further, the errors
of pitch variation are numerically compared. Table 4
shows average errors of pitch (frequency and amplitude)
between the representative and synthesized voices. Figure
5 shows amplitude error of synthesized emotional voice.
The error is sufficiently small to have the desired emotion
by subjective evaluations.

Fig. 5. Amplitude error of synthesized emotional voice.


Table 4: Average errors of pitch (frequency and amplitude)
between representative and synthesized emotional voices.

V. CONCLUSIONS
In this paper, analysis and synthesis methods of
emotional voice for man-machine natural interface was
developed. First, the emotional voice was analyzed using
time-frequency representation of speech and similarity
matrix. Then, based on the result of emotional analysis, a
voice with neutral emotion was transformed to synthesize
the particular emotional voice using time-frequency
modifications.
In the simulations, five types of emotion were analyzed
using 50 samples of speech signals and 81.02(%) of
average discrimination rate was achieved. Further, the
synthesized emotional voice was subjectively evaluated. It
is confirmed that the emotional voice was naturally
generated by the proposed time-frequency based approach.
REFERENCES
[1] Y. Kitahara and Y. Tohkura, "Prosodic Control to Express Emotion for
Man-Machine Speech Interaction", IEICE Trans., Vol.E75-A, No.2,
pp.155-163, 1992.
[2] K. Hirose, N. Takahashi, H. Fujisaki and 0. Sumio, "Representation
of Intention and Emotion of Speakers with Fundamental Frequency
Contours of Speech", Technical Report ofIEICE, HC94-41, pp.33-40,
1994-09.
[3] H. Kawanami and K. Hirose, "Considerations on the Prosodic
Features of Utterances with Attitudes and Emotions", Technical
Report of IEICE, SP97-67, pp.73-80, 1997- 11.
[4] Y. Tone, A. Ogihara and H. Shibata, "HMM Based Emotion
Discrimination for Speech Dialogue System", Technical Report of
IEICE, HC2000-22, pp.47-53, 2000-06.
[5] T. Moriyama, H. Saito and S. Ozawa, "Evaluation of the Relationship
between Emotional Concepts and Emotional Parameters on Speech",
(C) [6]
IEICE Trans., Vol. J82-D2, No.4, pp.703-711, 1999.
M. Sigenaga, "Features of Emotionally Uttered Speech Revealed by
Fig. 4. Example of time-frequency representation of Discriminant Analysis", IEICE Trans., Vol. J83-A, No.6, pp.726-735,
synthesized emotional voice (anger). (a) Neutral voice, (b) 2000.
Emotional voice, (c) Synthesized emotional voice. [7] S. Wada, H. Yagi and H. Inaba: "Effective Calculation of Finite Frame
Operator for The Multiple Short-Time Fourier Transform", Proc.
IEEE-SP Int. Symp. on Time-Frequency and Time-Scale Analysis,
pp.205-208, 1998.
[8] R.E. Crochiere and L.R. Rabiner: Multirate Digital Signal Processing,
Prentice-Hall, 1983.

1-4244-0049-X/05/$20.00 c 2005 IEEE 1645

Das könnte Ihnen auch gefallen