You are on page 1of 4

Integrated System for Prosodic Features Detection

from Speech

Marius Dan Zbancioc1,2 Monica Feraru1


Institute of Computer Science, Romanian Academy Institute of Computer Science, Romanian Academy
Technical University “Gheorghe Asachi” of Iasi Iaşi, România
Iaşi, România monica.feraru@iit.academiaromana-is.ro
zmarius@etti.tuiasi.ro

Abstract— The paper presents the instruments implemented in Filter, and Tunable IIR Filter, Cepstrum Analysis, Multi-
SROL (Voiced Sound of Romanian Project) used for the Resolution Methods [6].
extraction of the fundamental frequency F0 and of the formants
F1-F4. In order to have a better detection of the prosodic The pitch is influenced by the intensity, the duration, the
features, we put together four methods: autocorrelation method, emotional context and other parameters of the waveform [7]. In
cepstral methods, AMDF (average magnitude difference [8] the author proposed a new autocorrelation method for pitch
function), HPS (harmonic product spectrum). The final value of extraction, which is useful in the noisy environments and
pitch (F0) is established after an integration of the partial results provides a moderate improvement to the conventional
of each method depending on their performance. For the autocorrelation method.
formants detection we used a fuzzy method of spectrum
concatenation. The authors present in [9] a probabilistic error correction
technique in order to improve the pitch estimate error.
Keywords - fundamental frequency detection, prosodic features, Applying the AMDF the error rate it was reduced from 6.07%
precision annotation to 3.29%. In [10], the proposed methods for the fundamental
frequency estimation are correlated in the analysis with the
I. INTRODUCTION electroglottogram (EGG). They have selected four methods for
the comparisons (Praat’s autocorrelation method, cross-
The analyze of the vocal signal, the extraction of patterns in correlation method, subharmonic summation, and the estimator
order to automatically recognize the speaker or the spoken YIN). The method performances are studied when the signal is
language, to identify the expressed emotion, to translate the degraded with white noise at different levels and they were
spoken words into text, all this domains represent the research very robust to noise. The experiment was made on Keele and
areas since several decades. The most important parameters CSTR databases.
extracted from the vocal signal are the prosodic features: the
fundamental frequency F0, respectively the formants F1-F4. The remainder of this paper is structured as follows. In the
The trace of the fundamental frequency gives information next section, some brief information regarding the annotation
about the pitch, the intonation of the speech and the accent in of the sound files from SROL corpus are presented and in
phrase, respectively the word. section 3, the algorithm modules of the integrated system for
pitch detection are discussed. The results obtained after the
The vocal signal is non-stationary and has a nonlinear simulation, the comparisons with other instruments and the
dynamics source (glottis) and the upper tract is continuously conclusions are presented in the last section.
changing in time [1]. At this moment there is no a good enough
mathematical support for representing the F0 and the formant
II. SROL DATABASE – ANNOTATION FILES
concepts and for this reason the researchers seek for better
methods to extract the prosodic parameter of the speech. The SROL project (Voiced Sound of Romanian Language)
is a free resource on the web of sound files and instruments for
The extractor of F0 is depending on the context for which it speech processing. The sound collection contains at this
was developed. For a good detection with software instrument moment more than 1500 files and it is extended periodically.
is better to apply many detectors methods and to compare the
results in the end. There are standard methods in order to There are more than 1050 manually annotated files in the
extract the pitch. In the time domain we mention the methods: SROL corpus which can be used to validate the methods
Time-Event Rate Detection, Zero-crossing rate (ZCR) [2,3], implemented for the pitch detection. The annotation was made
Peak rate, Slope event rate, Autocorrelation method - The YIN with Praat software, but is difficult to achieve a high precision
Estimator [4], Phase Space - Phase Space and Frequency, and because the boundaries of phonemes can not be delimited only
Phase Space of Pseudo-Periodic Signals. In the frequency by listening the sound files. The human ear can perceive the
domain the methods in order to extract F0 are: Component intra-speech pauses and the limits of phonemes with a
Frequency Ratios [5], Filter-Based Methods - Optimum Comb precision of tenth of milliseconds.
TABLE I. ANNOTATED FILES FROM SROL CORPUS information (for example: 'l', ' m ', ' n', 'r ', ' v', ' z' etc.) and are
Files number Files type called voiced consonants. The segmentation will remove only
63 Vowels in context the unvoiced consonants and the pauses between utterances.
110 Diphthong, triphthong and hiatus files The functional blocks of the integrated system for the
177 Consonant in IPA standard (format VCV) prosodic features extraction are described in the next figure:
19 Double and simple subject in Romanian language
396 Emotional corpus (joy, fury, sadness, neutral tone)
Sound files (*.wav)
90 Standard prosody (exclamatory,interrogative, neutral) Preprocessing
210 Gnathosonic and gnatophonic files *
The estimation of the
Filtering
segmentation parameters
It was annotated 10 files from 4 speakers (2 female and 2
male speakers) with a high precision in the range of ms, using a Segmentation C/V Predictive NN segmentation
visual inspection of signal in time domain, searching the areas
with periodic signal and also in frequency domain where the
spectral peaks can offer information about F0. The rapport Extraction of F0
between the low and high frequency energy bands from the
spectrogram also helps to make an accurate determination of F0 F0 F0 F0
phonemes limits. Even so the C/V delimitation remains autocorrelation AMDF HPS cepstral
subjective and there are some regions where is difficult to w1 w2 w3 w4
make the annotation for example as the noises introduced by
the inhaled/exhaled air, the closing of the lips, the transition Decisional block
between phonemes.
In order to estimate the errors of our developed instrument Extraction of formats F1,…, F4
and to compare them with the errors given by other instruments Estimating parameters
(Praat, Wasp) it was used these high precision annotation files. for formants detection
Fuzzy system
During the annotation process in addition to the classical Smoothed spectrum for spectrum
methodology, we have been taken into account the following: method (F1,…, F4) concatenation
- For the specific sounds of Romanian language it was used
*
the notations: ‘â’=‘a-‘, ‘ă’=‘a+’, ‘ş’=‘sh’, ‘ţ’=’tz’. Statistical processing of data
- For the pause areas it was used different markings Figure 1. Block diagram of the integrated system for pitch detection
according to their nature. The marker used for the pauses
between utterances in sentences was blank characters, and for The algorithm for the extraction of fundamental frequency
the pauses between syllables, in word (intra-speech) was ‘$’ F0 and of the formants F1-F4 has three main phases. In the first
symbol. A special category was the pauses present at the level phase, signal preprocessing there are several operations which
of some consonants, such as ‘p’, ‘t’, ‘c’ etc. which have been prepare the input signal for the detection of prosodic features.
noted with ‘%’ symbol. The vocal signals were recorded with the sampling frequency
- The boundary intervals of each phoneme was established of 16 kHz or 22 kHz and with the resolution of 16 or 24 bits.
in the annotation process by auditory inspection and also by The input signal is resampled, if is necessary, in order to have
visual inspection of the spectrogram and of the waveform the sampling frequency of 16kHz. In the signal processing each
searching the periodic signal areas corresponding to vowels frame / analysis window has 64 ms and 1024 samples.
and voiced consonants. The main preprocessing steps are:
Each annotation has been validated by several assessors. P1. Signal filtering – HPF (High Pass Filter of 70Hz)
remove from the input signal the influence induced by the
III. FEATURE VECTORS EXTRACTION electric network at 50Hz and LPF of 4500Hz helps to separate
the frequency band corresponding to the F0 and the formants.
In order to extract with accuracy the pitch / fundamental
frequency and the formants, we need a very good segmentation P2. The estimation of the segmentation parameters – The
of the signal. The segmentation error from the preprocessing parameters used in classification of the unvoiced consonants, of
phase has the bigger influence in the final error because it the vowels and of the pauses are the zero crossing rate ZCR, the
affects the other subsequent software modules. average energy in time domain e_med and the energies in the
band B1 [70-1000]Hz, respectively B2 [1000-4500]Hz. In this
In our recently researches regarding the emotions
step is optionally to use of a decision tree (for example See5)
recognition, the features vectors was extracted only from the
which give a set of rules for the separation of the interest area.
regions which contain the prosodic information. The areas of
The threshold values may be empirically established after
interest are the vowels and the voiced consonants. Romanian
many simulations or can be automatically determined from the
language has seven vowels ‘a’, ‘e’, ‘i’, ‘o’, ‘u’, ‘ă’, ‘î’. Some
set of rules generated by a decision tree.
consonants depending on the context contain prosodic
P3. Segmentation C/V – the method uses the parameters generated Hg by the vibrations of the vocal cords, which can
obtained in the previous step: ZCR, e_med, B1, B2. The rules provide information about F0 and the other one related to how
applied in segmentation use the threshold values previously the voice signal is filtered Hf by the resonator model (oral and
established have the following form: nasal cavities, lips and tongue movements). The cepstrum
formula (the inverse Fourier transform of the logarithm of the
If the energy of the current analysis window in the time estimated spectrum) transforms the multiplication between two
domain Ew> v1•Emax the maximal energy of all the frames, then spectral components into an addition and because these
we consider that we have a speech signal. components are separable the F0 can be extracted.
If the spectral energy in the band B1 > v2•(B1+B2) from

( )
entire spectral energy, we consider that we have a vowel.
The first rule helps to identify the pauses and the second
C = ℑ−1 log S g (ω ) ⋅ S f (ω )
rule is based on the facts that the spectral energy is = ℑ (log S
−1
g (ω ) ) + ℑ−1 (log S f (ω ) ) (3)
concentrated in the high frequencies area for the consonants.
P3. Segmentation with predictive neural network NN [11] - The decisional block is used to weighting the results of the
this method is more robust and predicts/approximates better the fundamental frequency extraction methods. In the first phase
signal in the vowels areas where the signal is quasi-periodic there is applied a correction method which search in the trace
than in consonants areas where we have a noisy type signal. of F0 if we have consecutive values with a difference lower or
The extraction of F0 use simultaneous four methods: two higher than 20%. The correction algorithm is applied iteratively
based on the analysis in time domain: a) Auto-correlation, b) until all the jump values are removed. The final output is
method of Average Magnitude Difference Function – AMDF computed as weighted sum of the partial results of each method
and two Methods based on spectral analyze: c) Harmonic according to their performance. If a method has better
Spectral Product HPS and d) Cepstral analysis. performance in the pitch detection then the corresponding
weight will be higher than for a less accurate method.
M1. Autocorrelation - provides information about the
periodicity of the vocal signal. In order to obtain the correlation After simulation and comparison of the outputs with the
vector, we use the classical formula: high precision annotation files the most efficient methods were
the autocorrelation and the cepstral method and the associated
weights was 55%, respectively 35%. When we have indecision
N −1− i because the outputs of autocorrelation and cepstral methods are
AC ss [i ] = ∑ sˆ w [k ] ⋅ sˆw [k + i ] (1) significantly different, the other two methods HPS and AMDF
k =0 (which are not so accurate) may help in the final decision. The
weight for HPS is 5% and for AMDF is 5%.
In the correlation vector, we search the dominant local
maximum excepting the first value Css[0], which represents the For the F1-F4 formants detection the interval where is
energy of signal. Because de F0 frequency band is [70, 500]Hz searched each formant depends on the values of F0. For a male
the searching of local maximum associated with the voice and a small value of F0, we have other intervals than for
fundamental period is limited in the range [fs/F0max, fs/F0min], a female or child voice where the F0 is significantly higher.
respectively [32=16000/500, 229=16000/70]. The smoothed spectrum method – extracts a spectral
M2. AMDF – The average magnitude difference function is “envelope” by filtering the signal cepstrum. From the cepstral
similar with AC function, but in the computing formula, the signal there are kept only first L values (the others being made
multiplication operation is replaced with the subtraction and zero) and is applied the inverse transformation of the formula
the fundamental period T0 is given by a local minimum value: (3). The obtained spectral signal has slow transitions because
the "high frequency component" was deleted.

N −1− i
Dss [i ] = ∑ (sˆ [k ] − sˆ [k + i]) , i ∈ [32, 229]
k =0
w w (2) ⎧C[k ] , k ≤ L or k > N w − L
C* = ⎨
⎩ 0 , otherwise
M3. HPS – Harmonic Spectral Product consists in *
( ( ))
S (ϖ ) = exp FFT C * = e ℑ(C*) (4)
computing of the spectral signal with Fast Fourier Transform
FFT, the decimation of the spectrum with the factors 1/2, 1/3, The degree of smoothing depends on the number of
1/4,..., and the achieving of product between obtained signals. samples L kept from the cepstral signal. In some cases for
In the spectrum of a periodically signal with the fundamental small values of L, some formants can be difficult to find or do
frequency F0, will have harmonics at the multiple of this not appear in the spectral envelope. For this reason is was
frequencies 2·F0, 3·F0, 4·F0,... The product of these signals necessary to compute three envelope, one for F0 obtained with
will maximize the frequencies around F0, and the other spectral a small value of L and other two for the first formant F1-F2,
values will be strongly attenuated. respectively the formants F3-F4 obtained with higher values of
M4 Cepstral method – The spectrum of a vocal signal is the L and finally to concatenate different section of them. The
product of two components, one related to how the sound is detailed algorithm is presented in [12].
IV. RESULTS AND CONCLUSIONS A conclusion is that the autocorrelation and the cepstral
The instrument for prosodic feature detection was run for method have better results in the detection of pitch than HPS
several hundreds of sound files. We compared our integrated and AMDF method. The last two mentioned pitch detectors
system performance with the free software instruments help when we have different outputs of the main detectors.
available from Internet (Praat and Wasp). Our results are better Praat is the standard utility for the researchers in speech
than other instruments, because the final values are validated analysis. Comparing with Praat, our results obtained on the
by many detection methods. In Fig.2-4 was emphasized the SRoL sound files with high accuracy annotation was better. An
sections where each software give errors in the pitch detection. explanation is that we use many methods for F0 detection and
we made a correction of the outputs and an integration of the
partial results.
In future, we intend to improve the detection by introducing
supplementary restriction in the correction procedure and by
replacing HPS and AMDF with most efficient method.

ACKNOWLEDGMENT
We acknowledge to Romanian Academy priority research
Figure 2. Pitch detection using our instruments
grant “The cognitive systems and applications” and to entire
research team and especially to the professor H.N. Teodorescu
the coordinator of the project “Sounds of Romanian language”.

REFERENCES

[1] W. Rodriguez, H.-N. Teodorescu, F. Grigoras, A. Kandel, and H. Bunke,


“A Fuzzy Information Space Approach to Speech Signal Non-Linear
Analysis”, International Journal of Intelligent Systems, vol. 15, Issue 4,
pp. 343–363, 2000
Figure 3. Pitch detection using WASP software
[2] B. Kedem, “Spectral analysis and discrimination by zero-crossings”,
Proceedings of the IEEE, vol. 74(11), pp.1477–1493,1986.
[3] E. Scheirer and M. Slaney, “Construction and evaluation of a robust
multifeature speech/music discriminator”, In Int. Conf. on Acoustics,
Speech and Signal Processing, vol. II, pp. 1331–1334. IEEE, 1997.
[4] A. de Cheveigne and H. Kawahara, “Yin, a fundamental frequency
estimator for speech and music”, Journal of the Acoustical Society of
America, vol. 111(4), 2002.
[5] M. Piszczalski and B. Galler, “Predicting musical pitch from component
frequency ratios”, Journal of the Acoustical Society of America, vol.
66(3), pp. 710–720, 1979.
Figure 4. Pitch detection using Praat software [6] E. Geoffriois, “The multi-lag-window method for robust extended-range
F0 determination”, In Fourth Int. Conf. on Spoken Language Processing,
The table II presents the errors in F0 detection obtained by vol. 4, pp. 2239–2243, 1996.
all four methods, before to make the error correction. We [7] D. Gerhard, “Pitch Extraction and Fundamental Frequency: History and
mention that the annotation files have a small accuracy to the Current Techniques”, Technical Report TR-CS 2003, ISBN 0 7731 0455
boundaries of the phonemes. The associated weights to each F0 0, pp. 1-23, 2003
detector were established according to these errors. [8] T. Shimamura, “Weighted Autocorrelation for Pitch Extraction of Noisy
Speech”, IEEE transactions on speech and audio processing, vol. 9, no.
7, pp. 727-730, 2001.
TABLE II. F0 DETECTION METHODS ERRORS [9] G. Ying, L. Jamieson, and C. Michell, “A probabilistic approach to
AMDF pitch detection”, ICSLP, ISBN: 0-7803-3555-4, vol. 2, pp. 1201-
Average error Autocorrelation Cepstral AMDF HPS 1204, 1996.
Set of 111 utterances 4.8 % 11.8 % 35.2 % 30.2 % [10] K. Sri Rama Murty, B Yegnanarayana, “Event-based instantaneous
Set of 63 utterances 7.05 % 16.6 % 34.2 % 41.3 % fundamental frequency estimation from speech signals”, IEEE
Transactions on Audio, Speech, and Language Processing, vol. 17, no.4,
pp. 614-624, 2009.
TABLE III. COMPARISON BETWEEN PRAAT AND OUR INSTRUMENTS
[11] M. Zbancioc and M. Feraru, “The automatic segmentation of the vocal
Input files Our instrument Praat signal using predictive neural network”, Int. Symposium on Signals,
E1 E2 E1 E2 Circuits, and Systems - ISSCS, ISBN: 978-1-4799-3193-4, pp.1-4, 2013
High precision annotation files 1.269 0.384 1.565 0.6033 [12] M. Zbancioc and H.-N. Teodorescu, Fuzzy methods for automatic
extraction of speech formants, The 3rd International conference on
Telecommunication, Electronics and Informatics - ICTEI2010, May 20-
In table III was computed the errors E1 (false positive – it is 23, Chişnău, Republica Moldova, Proceedings Vol. II, ISBN 978-9975-
founded F0 where the annotation has not a vocalic sound) and 45-136-9, pp.57-62, 2010
E2 (false value of F0 – jumps in consecutive samples).