Adelpozo Thesis

Voice Source and Duration Modelling
for Voice Conversion and Speech

Repair
Arantza del Pozo
Christs College
and
Cambridge University Engineering Department
April 2008
Dissertation submitted to the University of Cambridge
for the degree of Doctor of Philosophy
Declaration
This thesis is the result of my own work carried out at the Cambridge
University Engineering Department and includes nothing which is the out-
come of work done in collaboration except where specically indicated in
the text. Some material has been previously presented at international
conferences [21] [22].
The length of this dissertation, including appendices, bibliography,
footnotes, tables and equations is approximately 38, 000 words. This dis-
sertation contains 47 gures.
ii
Abstract
Voice Source and Duration Modelling for Voice Conversion and Speech Repair
Arantza del Pozo
Voice Conversion aims at transforming a source speakers speech to sound like that of a
different target speaker. Text-to-speech synthesisers, dialogue systems and speech repair are
among the numerous applications which can greatly benet from the development of voice con-
version technology. Whilst state-of-the-art implementations are capable of achieving reasonable
conversions between speakers with similar voice characteristics and prosodic patterns, they do
not work as well in scenarios where the differences between the source and the target speech
are more extreme. This is mainly due to limitations in the modelling and conversion of the voice
source and prosody. In this thesis, a rened modelling and transformation of the voice source
and duration is proposed to increase the robustness of voice conversion systems in extreme
applications. In addition, the developed techniques are tested in a speech repair framework.
Voice source modelling renement involves using the Liljencrants-Fant model instead of the
linear prediction residuals employed by the existing implementations to represent the voice
source. A speech model has been developed which automatically estimates voice source and
vocal tract lter parameterisations. The use of this speech modelling technique for the analysis,
modication and synthesis of speech allows the application of linear transformations to convert
voice source parameters. The performance of the developed conversion system has been shown
to be comparable to that of state-of-the-art implementations in terms of speaker identity, but to
produce converted speech with a better quality. Regarding duration, a decision tree approach
is proposed to convert duration contours. Its application has been shown to reduce the mean
square error distance between the converted and target duration patterns and to increase their
correlation.
The developed speech model and duration conversion techniques are then tested in an ex-
treme application: the repair of the voice source and duration limitations of tracheoesophageal
speech. Tracheoesophageal voice source repair involves the replacement of the glottal source,
smoothing of jitter and shimmer, reduction of the aspiration noise component and raising of the
fundamental frequency in some cases. As for duration, decision trees trained on normal data
are employed to repair the tracheoesophageal duration contours. The performance of the re-
pair algorithms has been found to be highly dependent on the quality of the tracheoesophageal
speakers. Whilst the repaired speech has been found to be less deviant, ugly and unpleasant
to listen to overall, its naturalness, intelligibility and rhythm is still relatively poor compared to
that achieved for normal speakers.
iii
Acknowledgements
Many people have made this work possible. First, I would like to sincerely thank my supervi-
sor Steve Young for his guidance, encouragement and support throughout my time as a PhD
student. For allowing me to explore the challenging world of tracheoesophageal speech repair
and for always nding the time to discuss the difculties along the way. For all his constructive
suggestions and advice.
Thanks also to KK Hui Ye for providing me the PSHM and High-Quality Voice Morphing
implementations, and to Sarah West from the Speech Therapy Department at Addenbrookes
Hospital for organising the tracheoesophageal speech recording sessions. Thanks to the tra-
cheoesophageal speakers who volunteered for the recordings and to anyone who took part in
the numerous listening tests carried out as part of this work.
I would also like to thank all the colleagues from the Machine Intelligence Lab - particularly
Steves fellow students Zeynep Inanoglu and Jost Schatzmann - for all the support, fruitful dis-
cussions and fun times together. Thanks also go to Patrick Gosling and Anna Langley for their
efcient IT help throughout these years. And to Janet Milne for her loving admin support.
I deeply appreciate the nancial support I have received from the Basque Government under
the researcher training grant programme. Thanks to Javier Urrutia for helping me apply in the
rst place. I am also grateful of the additional support from the Department of Engineering, and
from Christs College.
Special thanks to all the Cambridge friends who have accompanied me along the process.
Thanks to Thet Su Win, Sirichai Chongchitnan and Gian Paolo Procopio for being always there.
To the Italian maas I and II for all the parties and Cavendish lunches. Cambridge-ko Eu-
skaldunei for the Basque outings. To the CUCDW dancers for helping me keep up with con-
temporary dance. To Matti Airas for his Skype support during the writing-up. And to Barbara,
Arnaud and Giorgio Gac-Moles for taking care of me throughout the end.
Finally, but most importantly, I wish to thank my parents Arantxa Etxezarreta and Manuel
del Pozo, and my sister Maitane for their unconditional support, encouragement, positive energy
and love. To them I dedicate this thesis.
iv
Contents
Table of Contents v
List of Figures vii
List of Tables x
1 Introduction 1
1.1 Motivation 1
1.2 Overview and Limitations of Voice Conversion 2
1.3 Summary of Proposed Approach 4
1.4 Outline 4
2 Speech Modelling and Feature Transformations in VC 6
2.1 Speech Modelling 6
2.1.1 Acoustic Theory of Speech Production 6
2.1.2 The Source-Filter Model 10
2.1.2.1 Linear Prediction 11
2.1.3 Sinusoidal Models 13
2.1.3.1 The McAulay-Quatieri Model 13
2.1.3.2 The ABS/OLA Model 13
2.1.3.3 The HNM Model 14
2.1.3.4 Modication of Speaker Identifying Features 15
2.2 Feature Transformations 17
2.2.1 Spectral Envelope Conversion 17
2.2.2 Voice Source Conversion 21
2.2.3 Prosody Conversion 23
2.3 Evaluation 24
2.4 Limitations of the current approaches 25
v
vi
3 Joint Estimation Analysis Synthesis 27
3.1 JEAS Model 27
3.1.1 Modelling the Voice Source: LF model 27
3.1.2 Source-Filter deconvolution 31
3.1.2.1 Inverse Filtering and Glottal Parameterisation 32
3.1.2.2 Joint Source-Filter Estimation 35
3.2 JEAS Analysis 38
3.2.1 Voicing Decision and GCI Detection 38
3.2.2 Adaptive Joint Estimation 38
3.2.2.1 Encoding the Spectral Tilt 41
3.2.3 LF Fitting 43
3.2.4 Modelling Aspiration Noise 44
3.3 JEAS Synthesis 49
3.4 Pitch and Time-Scale Modication 52
4 JEAS Voice Source and CART Duration Modelling for VC 54
4.1 Sinusoidal Voice Conversion 54
4.1.1 PSHM 54
4.1.3 Spectral Residual Prediction 56
4.1.4 Phase Prediction 57
4.2 JEAS Voice Conversion 57
4.2.2 Glottal Waveform Conversion 58
4.3 PSHM vs. JEAS Voice Conversion 61
4.3.1 Objective Evaluation 61
4.3.1.1 Spectral Envelope Conversion 61
4.3.1.2 Voice Source Conversion 62
4.3.2 Subjective Evaluation 64
4.4 Duration Conversion 65
5 Application: Tracheoesophageal Speech Repair 70
5.1 Laryngectomy 70
5.1.1 Speech Production after Laryngectomy 71
5.2 Limitations of TE Speech 73
5.3 Speech Corpora 74
5.4 Voice Source Repair 76
5.4.1 Glottal Replacement 76
5.4.2 Jitter and Shimmer Reduction 82
5.4.3 F0 Raise 82
vii
5.4.4 Perceptual effect 83
5.5 Duration Repair 86
5.5.1 Tracheoesophageal phone recognition 86
5.5.2 Duration Prediction 90
5.5.3 Robust duration modication 91
5.6 Evaluation 94
5.6.1 CPCLP vs. JEAS glottal replacement 95
5.6.2 Evaluating Voice Source Repair 95
5.6.3 Evaluating Duration Repair 97
5.6.4 Ranking Quality 98
5.7 Discussion 99
6 Conclusions and Future Work 101
6.1 Conclusions 101
6.1.1 Joint Estimation Analysis Synthesis 101
6.1.2 JEAS Voice Source and CART Duration Modelling for VC 102
6.1.3 Tracheoesophageal Speech Repair 103
6.2 Future Work 103
A Recorded stimuli 113
List of Figures
2.1 Human Speech Production Apparatus 7
2.2 Glottal Waveform and Frequency Spectrum 8
2.3 Acoustic Theory of Speech Production 10
2.4 Linear Prediction Model 12
2.5 Properties of the Linear Prediction Residual: a) speech frame, b) LP residual ,c)
speech spectrum and LP spectral envelope, d) LP residual spectrum 12
2.6 Frequency warping example 18
2.7 Codebook Mapping 20
2.8 Weighted Codebook Mapping 20
2.9 Post-ltering example 21
3.1 JEAS Model 28
3.2 Modelling the Glottal Wave 29
3.3 Modelling the Derivative Glottal Wave 29
3.4 LF model 30
3.5 Inverse Filtering 33
3.6 Iterative Adaptive Inverse Filtering 33
3.7 Joint Estimation Analysis Model 39
3.8 Joint Estimation example: a) speech period, b) speech spectrum and jointly es-
timated spectral envelope, c) inverse ltered residual and jointly estimated RK
wave 41
3.9 RK derivative glottal wave 42
3.10 Spectral Tilt modelling in Lu and Smiths formulation 42
3.11 Effect of adaptive pre-emphasis: a) Speech spectrum (S) and estimated spectral
envelope (SE), b) IF derivative glottal wave (IF dgw) and tted LF waveform
(tted dgw), c) IF derivative glottal wave spectrum (IF dgw) and tted LF wave
spectrum (tted dgw) 43
viii
LIST OF FIGURES ix
3.12 LF tting examples: a) normal, b) breathy and c) pressed IF derivative glottal
waves (dgw) and tted LF waveforms (tted dgw) 44
3.13 Denoising example: a) original and denoised IF derivative glottal wave, b) noise
estimate 45
3.14 Standard Aspiration Noise Model Parameters 46
3.15 Gaussian Noise modulation by a derivative glottal LF waveform: a)Gaussian Noise
source, b)derivative glottal LF waveform, c)LF Modulated Gaussian Noise 47
3.16 Aspiration Noise Model 47
3.17 Overlap-Add Synthesis 50
3.18 Resampling the frame size contour 52
4.1 JEAS vs. PSHM spectral envelopes 58
4.2 Linear Transformation of LF Glottal Waveforms: a) source, target and converted
derivative glottal LF waves; b) source, target and converted trajectories of the
glottal feature vector parameters (E
e
, R
g
, R
k
, R
a
, N
e
) 59
4.3 R
LSF
distortion ratios of the converted PSHM and JEAS spectral envelopes 62
4.4 R
LSD
distortion ratios of Residual Predicted (RP) and Glottal Waveform Con-
verted (GWC) spectra 63
4.5 Results of the ABX test 64
4.6 Results of the quality comparison test 65
4.7 An example partial CART decision tree for phone duration prediction 67
4.8 R
MSE
duration distortion ratios 69
4.9 Correlation of source-target (ST) and converted-target (T1,T2,T3) duration con-
tours 69
5.1 Speech production apparatus before and after laryngectomy 71
5.2 Enhancement examples: a) LSF smoothing: original (crosses) and smoothed
(continuous) trajectories of rst four LSF coefcients b) Tilt reduction: original
(continuous) and reduced (dotted) spectral envelopes 79
5.3 Normal and TE JEAS derivative glottal wave estimates: a) estimated derivative
glottal waveforms, b) estimated derivative glottal waveforms (IF dgw) and tted
LF waves (tted dgw), c) aspiration noise estimates 79
5.4 Original TE speech 81
5.5 Repaired TE speech 81
5.6 Fundamental Period and Energy Contour Smoothing 83
5.7 Force-Aligned (FA) and Recognised (REC) Segmentations and Labels (continuous
if correct, discontinuous if wrong) 89
5.8 CPCLP vs. JEAS Glottal Replacement Evaluation 95
5.9 Voice Source Repair Evaluation 96
5.10 Duration Repair Evaluation 98
5.11 Mean Opinion Naturalness, Intelligibility and Rhythm Scores of repaired TE speech 99
List of Tables
3.1 Joint Estimation Methods [ where s and s are the true and predicted speech
signals, g and g are the true and predicted glottal waveforms, O and C are the
open and closed phases and LF and HF the lowand high frequencies respectively
] (Refer to the List of Acronyms) 37
5.1 Sociodemographic and clinical information of the TE speakers 75
5.2 Average Male TE and Normal F0 [Hz] 84
5.3 Average Female TE and Normal F0 [Hz] 84
5.4 The recognisers tested and their corresponding features 89
5.5 Evaluation of Recogniser Performance 90
5.6 Description of trees and corresponding features 91
5.7 Evaluation of tree duration prediction 91
5.8 Robust modication systems 92
5.9 Evaluation of Repair Systems 92
x
List of Acronyms
ABS/OLA Analysis-By-Synthesis OverLap Add
AJE Adaptive Joint Estimation
AM Aerodynamic-Myoelastic
ARX Auto-Regressive model with an eXogenous input
CART Classication And Regression Tree
CMLLR Constrained Maximum Likelihood Linear Regression
CMN Cepstral Mean Normalisation
CPCLP Closed-Phase Covariance Linear Prediction
DAP Discrete All-Pole
EGG Electroglottography
GCI Glottal Closure Instants
GMM Gaussian Mixture Model
HMM Hidden Markov Model
HNM Harmonic plus Noise Model
IAIF Iterative Adaptive Inverse Filtering
IF Inverse Filtering
JEAS Joint Estimation Analysis Synthesis
LF Liljencrants-Fant
LM Language Model
LP Linear Prediction
LSE Least Square Error
LSF Line Spectral Frequencies
MOS Mean Opinion Score
MSE Mean Square Error
PSHM Pitch-Synchronous Harmonic Model
PSOLA Pitch-Synchronous Overlap Add
RK Rosenberg-Klatt
RMS Root Mean Square
SA Simulated Anealing
SPC Segmentation and Prediction Correctness
SPE Segmentation and Prediction Error
TE Tracheoesophageal
TTS Text-to-Speech
VC Voice Conversion
VQ Vector Quantization
xi
1
Introduction
1.1 Motivation
Voice conversion (VC) is the process of modifying a source speakers speech to make it sound
like that of a different target speaker. Due to its wide range of applications, there has been a
considerable amount of research effort directed at this problem in the last few years.
As an end in itself, it has use in many anonymity and entertainment applications. For ex-
ample, voice conversion can be used in the lm dubbing industry and/or automatic translation
systems to maintain the identity of the original speakers when translating from the original lan-
guage to another. It can also be applied to transform an ordinary voice singing karaoke into a
famous singers voice. Or to mask the identity of a speaker who wants to remain anonymous on
the telephone. Computer aided language learning systems can also benet from voice conver-
sion, by using converted utterances as feedback for the learner.
Another important application is the customization of text-to-speech (TTS) systems. Typi-
cally, unit selection and concatenation TTS synthesis requires the recording of large speech cor-
pora by professional speakers. Because of the high cost involved in recording a separate database
for each new speaker, commercial text-to-speech implementations only generate speech by a few
speakers. Voice conversion can be exploited to economically synthesise new voices from previ-
ously recorded and already available databases.
In dialogue systems, voice conversion technology can be used to adapt speech outputs to dif-
ferent situations and make man-machine interactions more natural. Systems can mimic human-
human communication using emotion conversion techniques to transmit extra information to
the user on how the dialogue is going by generating condent, doubtful, neutral, happy, sad or
angry utterances for example. They can also modify the focus to indicate more precisely the
informational item in question.
In addition, the knowledge derived from the parameterisation and modication of speech
parameters for voice conversion can also be exploited in disordered voice repair applications.
Voice conversion algorithms can be used to convert deviant parameters and regenerate speech
with more natural quality and higher intelligibility. The development of speech repair appli-
1
CHAPTER 1. INTRODUCTION 2
cations can have a big social impact if oriented to help improve the quality of life of people
suffering speech disabilities.
Finally, as an attempt to separate speaker identity and message, voice conversion can be very
useful in core technologies such as speech coding, synthesis and recognition to achieve very low-
bandwidth speech coding, lead to more accurate speech models and improve the performance
of state-of-the-art speech recognisers.
1.2 Overview and Limitations of Voice Conversion
Voice conversion systems need to be capable of transforming the cues which make the source
and target speech be perceived as different. Speaker characteristics contained in the acoustic
speech signal can be classied into segmental and suprasegmental cues, both of which have
been shown to be perceptually relevant for speaker identication:
Segmental cues depend on the physiological and physical properties of the speech organs,
describe the timbre of a speakers voice and conform the speech spectrum. They are de-
termined by the quality of the voice source and the shape of the vocal tract, generally
encoded by the glottal waveform and spectral envelopes, respectively.
Suprasegmental features are, on the contrary, inuenced by psychological and social fac-
tors and describe the prosodic features related to the style of speaking. They are mainly
encoded in pitch, duration and energy contours.
In order to achieve full conversion, voice source, vocal tract and prosodic characteristics
should be transformed. However, in practice the particular features for conversion vary depend-
ing on the application. For identity conversion for example, spectral envelopes and average
pitch and duration, which have been shown to provide a high degree of speaker discrimination
by humans, are mostly transformed. This approach has been shown to give good results when
speakers with similar characteristics and prosodic patterns are morphed. In emotion conversion,
on the other hand, it is mainly pitch and duration contours which make a difference, with voice
source quality and spectral envelope features being also relevant for some emotions . For speech
repair applications, the features which make the disordered speech sound deviant need to be
identied rst. Average pitch, glottal waveforms and duration contours of tracheoesophageal
speech or formant frequencies and duration of dysarthric speech for example, are deviant fea-
tures which have been attempted to repair.
Different voice conversion systems employ different methods to achieve successful conver-
sions, but they all share speech modelling, training and transformation components.
A speech model is a mathematical representation of the speech signal that makes its analy-
sis, manipulation and transformation possible. For the particular case of voice conversion,
it needs to allow an artifact free modication of the speaker characteristic features to be
converted, i.e. voice source, vocal tract, pitch, duration and energy. The Source-Filter and
Sinusoidal models are the most widely used in voice conversion frameworks.
The Source-Filter Model [28] describes the acoustic properties of speech production. In
this model, a source or excitation waveform is input to a time-varying lter. Two ele-
mental source types are generally modelled: voiced and unvoiced excitation. The glottal
waveform acts as the source excitation produced by the vibration of the vocal folds in
voiced speech, while white noise is used to describe unvoiced excitation. The time-varying
lter represents the vocal tract shape, which selectively boosts/attenuates certain frequen-
cies of the excitation spectrum depending on the location and position of the tongue, jaw,
lips and velum. It can be modelled as an all-pole lter with a relatively low number of
parameters (typically between 10 and 30) estimated using linear prediction techniques.
This fact makes the Source-Filter Model very attractive for the conversion of the spectral
characteristics of the speech wave.
Sinusoidal models [66, 39], on the other hand, approximate the input speech signal as
a sum of a number of sinusoids with time-varying amplitudes, frequencies and phases.
Despite the higher dimensionality of their spectral representation, they have been shown
to be capable of synthesising speech which is perceptually almost indistinguishable from
the original. For this reason, they are often preferred in applications requiring high-quality
speech resynthesis such as voice conversion.
In order to allow modication of speaker identifying features, Sinusoidal Models exploit
the more exible Source-Filter concept to carry out pitch and time-scale, spectral envelope
or voice source transformations. A more detailed description of these models is presented
in chapter 2.
During training, the system estimates a transformation function between the source and
the target parameter spaces to be converted. Generally, standard machine learning meth-
ods such as Vector Quantization (VQ), Hidden Markov Models (HMM), Gaussian Mixture
Models (GMM), Decision Trees, Codebook-Mapping and Unit Selection approaches are
used. Advantages and disadvantages of such methods for the transformation of the differ-
ent features are discussed in chapter 2.
Finally, in transformation mode, the transformation function obtained in the training phase
is used to predict target speech features from source features. These predicted features
allow the converted speech to be synthesised using the chosen analysis/synthesis speech
model.
The main limitations of state-of-the-art voice conversion systems lie in two specic areas: the
modelling and conversion of the voice source and the modelling and conversion of prosody. To
date, most of the work on voice conversion has been focused on the transformation of spectral
envelopes. As a result, state-of-the-art voice conversion systems are capable of achieving suc-
cessful spectral envelope conversions between source and target speakers. On the contrary, little
work has been done on the transformation of the other speaker identifying features, i.e. voice
source, pitch, duration and energy. However, both source and prosodic characteristics have been
shown to be important and to improve conversion performance.
The main problem with voice source and prosody conversions is the lack of an adequate and
automatic method for their parameterisation. For that reason, alternative techniques have been
developed in an attempt to incorporate target source and prosodic information into spectral en-
velope conversions. Methods related to source transformation generally exploit the correlation
between spectral envelopes and linear prediction residuals to predict or select the best residual
candidate from databases built during training. Regarding prosody, most often just the average
source fundamental frequency is simply scaled to match that of the target.
These methods perform reasonably well when transforming speakers with similar voice char-
acteristics and using similar prosodic patterns. However, they are not very robust and do not
work as well in more demanding applications such as conversion of speakers with very different
accents, emotional conversion or impaired voice repair.
1.3 Summary of Proposed Approach
In this work, a rened modelling and transformation of the voice source and duration is proposed
to increase the robustness of voice conversion systems in extreme applications. The developed
speech modelling and transformation techniques have been shown to be applicable even to
repair disordered speech. More precisely, the contributions of this thesis can be summarised as
follows:
The development of a speech analysis-modication-synthesis model capable of automati-
cally estimating joint voice source and vocal tract parameterisations, producing speech al-
most indistinguishable from the original and supporting high-quality pitch and time-scale
transformations.
A novel voice source transformation method which allows the conversion of source quality.
A rened duration modelling and conversion approach based on decision trees trained on
text-based features.
Demonstrating the validity of the proposed techniques in extreme applications, such as
the repair of tracheoesophageal speech.
1.4 Outline
The remainder of the thesis is organised as follows:
Chapter 2 overviews the speech models and techniques employed to convert the most relevant
speaker identifying features in state-of-the-art voice conversion frameworks.
Chapter 3 presents the Joint Estimation Analysis Synthesis (JEAS) model developed for the
analysis, transformation and synthesis of speech which automatically and simultaneously
parameterises both the vocal tract and the voice source, and can also perform high-quality
prosodic modications.
Chapter 4 investigates the use of the JEAS glottal source parameterisation and linear transfor-
mations for voice source conversion and evaluates its performance against that of a state-
of-the-art voice conversion system based on sinusoidal modelling and residual prediction.
In addition, a decision tree approach is proposed for duration conversion.
Chapter 5 presents the explored tracheoesophageal speech repair approach, which applies the
JEAS modelling and Classication And Regression Tree (CART) duration transformation
techniques developed for voice conversion to repair the two most affected characteristics
of tracheoesophageal voices, i.e. voice source and duration.
Chapter 6 concludes with a summary of the work presented in this dissertation and suggests
directions for future research.
2
Speech Modelling and Feature Transformations in VC
In this chapter, the speech models and transformation techniques employed in voice conversion
systems are reviewed. The Acoustic Theory of Speech Production is introduced rst, as a back-
ground for the description of the Source-Filter Model and Linear Prediction. Three of the most
popular Sinusoidal Models, i.e. the McAulay-Quatieri, ABS/OLA and HNM models are then pre-
sented together with the methods they employ to modify the most relevant speaker identifying
features. Next, the techniques most commonly used to convert spectral envelope, voice source
and prosodic features, and the measures generally employed to evaluate their performance are
described. The limitations of the current speech models and conversion methods are nally
discussed in the last section.
2.1 Speech Modelling
Two main types of models exist which allow the mathematical representation and manipulation
of the speech wave: Source-Filter and Sinusoidal Models. While the source-lter description
attempts to replicate the way speech is acoustically produced, sinusoidal representations de-
compose the speech signal into sequences of time-varying sinusoids. Despite their differences,
both models have advantages which make them attractive for voice conversion applications. For
this reason, most voice conversion frameworks employ a combination of Sinusoidal and Source-
Filter Modelling to analyse, modify and synthesise speech. The following sections introduce the
process of speech production and describe the basic principles, strengths and weaknesses of both
the Source-Filter and Sinusoidal Models, in the context of voice conversion.
2.1.1 Acoustic Theory of Speech Production
The organs involved in the production of the speech wave comprise:
Lungs: source of air during speech
Trachea: windpipe
6
CHAPTER 2. SPEECH MODELLING AND FEATURE TRANSFORMATIONS IN VC 7
Larynx: organ of voice production
Pharyngeal cavity and Oral or buccal cavity: correspond to the throat and mouth respec-
tively and are often grouped into one unit referred to as the vocal tract
Nasal cavity: nose, often called nasal tract
Articulators: ner anatomical components which move to different positions to produce
the different speech sounds. The main articulators are the vocal cords, soft palate or
velum, tongue, teeth, lips and jaw.
Figure 2.1 Human Speech Production Apparatus
Speech production can be explained in terms of an acoustic ltering operation [28] where
the main cavities of the speech production apparatus shown in Figure 2.1 conform the acoustic
lter. This lter is excited by the organs below it and loaded at its output by the radiation of the
lips.
In such a model, two elemental source types are mainly described and modelled: voiced and
unvoiced excitation, which classify speech sounds into voiced and unvoiced respectively.
Voiced excitation: the larynx is the main organ responsible for the production of the l-
ter excitation during voiced speech. Formed by the rst two or three tracheal rings, the
epiglottis and the vocal cords, it is connected through the trachea with the lungs and
through the pharynx with the vocal and nasal tracts. The opening between the vocal folds
is often referred to as the glottis.
In order to produce the voiced excitation, the vocal folds vibrate periodically interrupting
the subglottal airow passing through the glottis, generating quasi-periodic puffs of air
which excite the vocal tract.
The vocal cord oscillation process occurs as follows [98]. Initially, the vocal folds are closed
and the air stream from the lungs builds up pushing the vocal folds apart. Eventually, the
pressure reaches a level sufcient to force the vocal folds to open and thus allow air to
ow through the glottis. Then by the Bernouilli principle, the pressure in the glottis falls
allowing the cords to come together again, and so the cycle is repeated.
The general shape of the resulting voiced excitation known as the glottal waveform is
illustrated in Figure 2.2. The closed-phase of the oscillation occurs when the glottis is
closed and the glottal volume velocity ow is zero. On the other hand, the open-phase
is characterised by a non-zero volume velocity, in which the lungs and the vocal tract are
coupled. Its spectral shape can be described as having a resonant peak in the lowfrequency
region referred to as the glottal formant, and a monotonically decreasing spectral slope
thereafter.
2 5 8
0
0.5
1
time (ms)
v
o
l
u
m
e

f
l
o
w

r
a
t
e

(
c
m
3
/
s
)
0 4000 8000
40
0
40
frequency (Hz)
l
o
g

a
m
p
l
i
t
u
d
e

(
d
B
)
t
2
t
3
t
1
T
0
opening phase (OP)
closing
phase
(CP)
open phase (O) closed phase (C)
glottal formant
spectral tilt
Figure 2.2 Glottal Waveform and Frequency Spectrum
The time between successive vocal fold openings is called the fundamental period T
0
, while
the rate of vibration is called the fundamental frequency of the phonation F
0
=
1
T
0
and
pitch corresponds to the perceived fundamental frequency. Although technically they refer
to the physical and perceptual aspects, respectively, the terms fundamental frequency and
pitch will be used interchangeably throughout this thesis.
In addition to fundamental frequency, the vibration pattern of the vocal cords also controls
voice source quality. It is in fact the variation of glottal muscular tensions that distinguishes
between modal, breathy, harsh, creaky, whispery, tense or lax phonations [56]. Studies of
glottal waveforms in diverse phonations have shown that the glottal pulse width, the glottal
pulse skewness, the abruptness of glottal closure and the aspiration noise component are the
main factors which determine different source types [18].
The glottal pulse width is often also referred to as the Open Quotient (OQ) and corresponds
to the ratio of the time in which the vocal folds are open and the whole pitch period
duration, OQ =
O
O+C
(see Figure 2.2). It indicates the duty cycle of the glottal ow [30].
The glottal pulse skewness, also known as asymmetry coefcient (
m
) or Speed Quotient
(SQ), represents the asymmetry of the glottal pulse and is determined as the time it takes
for the vocal folds to open divided by the time it takes for the vocal folds to close, SQ =
OP
CP
.
Generally, the glottal airow is skewed to the right, which means that the decrease of
airow is faster than its increase. These two features, i.e. OQ and SQ, determine the
overall shape of the glottal pulse and affect the lowest frequency components of the source
spectrum, i.e. the part corresponding to the glottal formant [25, 43, 26].
The higher frequencies and spectral tilt of the source spectrum are, on the contrary, deter-
mined by the abruptness of the glottal closure, often also described as the Closing Quotient
(CQ), which is dened as the ratio between the glottal closing phase and the fundamental
period, CQ =
CP
O+C
[30, 25, 43, 26].
Apart from the quasi-periodic glottal pulses, stochastic energy in the form of aspiration
noise is also produced during voiced phonation. Glottal aspiration noise is pitch syn-
chronous with the glottal waveform. Its likelihood exists during the entire open phase and
maximum noise power is achieved right after the instant in which the vocal cords begin
to close [19]. A high power noise burst is also likely at the instant of glottal opening.
Thus, generally two pulses of noise occur per cycle: a stronger one at glottal closure and
a weaker one at glottal opening. Aspiration noise is perceptually very important during
breathy and whispery speech production.
Unvoiced excitation: is generated by forming a constriction at some point along the vocal
tract, driving a turbulent noise-like excitation caused by the airow passing through such
a constriction.
The characteristics of the acoustic lter are dened by the shape of the vocal plus nasal
tracts, which is determined by the position of the articulators, i.e. velum, tongue, teeth and jaw,
changing over time. As for the radiation component, the lips can be thought of as acting as a
low-impedance load responsible for converting the volume velocity waveform into a pressure
wave.
According to this theory, speech is thus produced when the air stored in the lungs passes
through the larynx generating an excitation signal, which can be voiced, unvoiced or a mixture
Turbulence
Noise
Vocal Tract Lip Radiation
s(n)
Glottal Wave
Aspiration
Noise
G
V
G
U
Figure 2.3 Acoustic Theory of Speech Production
of both (e.g. for the production of plosive and/or fricative phonemes), to be modulated by the
shape of the vocal tract and radiated through the lips. Such a speech production mechanism is
schematically represented in Figure 2.3.
2.1.2 The Source-Filter Model
The Source-Filter Model follows the Acoustic Theory of Speech Production described in the
previous section and involves the modelling and parameterisation of its different components.
Generally, Turbulence and Aspiration Noise are modelled using white gaussian and amplitude-
modulated white gaussian noise respectively, and Lip Radiation is approximated by the digital
differentiator of equation (2.1)
R(z) = Z
lips
= 1 a z
1
, 0 < a < 1 . (2.1)
The Vocal Tract is usually represented as an all-pole lter, whose characteristics have been shown
to be common to a Lossless Tube Concatenation approximation of the vocal tract [23]. Regarding
the Glottal Wave, numerous parametric models which attempt to describe the shape of the glottal
pulses have been proposed in the literature [36, 40, 26].
The main challenge within source-lter modelling lies in estimating the different component
parameters from the speech signal. While the white gaussian noise amplitudes can be easily
calculated from the energy of the unvoiced speech segments and the radiation lter coefcient
a is generally set to a xed value, the estimation of the vocal tract lter coefcients and glottal
waveform parameters from the speech signal is a difcult problem. Various techniques have
been proposed to obtain source and lter parameterisations from the speech wave. Among
those, Linear Prediction (LP) is probably the most widely used.
2.1.2.1 Linear Prediction
Linear Prediction avoids the problem of separating the contributions of the source and the vocal
tract by approximating the glottal wave with a two pole lter and encoding the combined effects
of the glottal source, vocal tract and lip radiation components in a unique all-pole lter H(z).
Such a lter is then excited, as shown in Figure 2.4, by a sequence of impulses spaced at the fun-
damental period T0 during voiced or white gaussian noise during unvoiced speech production
respectively [10].
According to this LP model, the speech signal s
n
is approximated by a linearly weighted
summation of its past values
s
n
=
p
k=1
a
k
s
nk
, (2.2)
and the prediction error is dened as the difference between the actual s
n
and the predicted s
n
speech values
e
n
= s
n
s
n
= s
n
k=1
a
k
s
nk
. (2.3)
e
n
is often also referred to as the LP residual. The lter coefcients a
k
can then be estimated
minimising the mean squared prediction error of the segment under analysis
E =
n
e
2
n
=
n
(s
n
k=1
a
k
s
nk
)
2
. (2.4)
The resulting set of normal equations can be solved using a variety of methods, amongst which
autocorrelation and covariance are the most common.
From the spectral point of view, LP attempts to closely match the speech spectrum and pro-
duce an error signal that is white, i.e. has a at spectrum. If the speech signal was actually the
response of an all-pole lter, the linear prediction residual would be a train of impulses spaced
at the voiced excitation instants and the impulse/noise source modelling would be accurate. In
practice, however, the residual looks more like a white noise signal with higher energy around
the instants of excitation. These properties are illustrated in Figure 2.5.
The strength of LP lies in its ability to automatically estimate a set of lter coefcients which
compactly represent the spectral envelope of the speech spectrum. For this reason, it has gained
wide acceptance in applications where the spectral characteristics of the speech wave need to
be captured with a small number of parameters, e.g. speech recognition, speaker identication
or voice conversion systems.
Its main drawback stems from its over-simplied model of the glottal source. Because the LP
residual represents the error in the associated LP lter, exciting the lter with its corresponding
residual results in speech that is indistinguishable from the original. However, when an impulse
train is employed as the voiced excitation instead of the LP residual, speech with a very buzzy
quality is produced. This fact has prevented the use of the LP model in high-quality speech
H(z)
s(n)
V/U
Glottal Wave, Vocal Tract and Lip Radiation
Figure 2.4 Linear Prediction Model
0 1000 2000 3000 4000 5000
40
30
20
10
0
10
20
frequency (Hz)
l
o
g

a
m
p
l
i
t
u
d
e

(
d
B
)
0 0.01 0.02 0.03 0 0.01 0.02
0.4
0.2
0
0.2
0.4
time (s)
a
m
p
l
i
t
u
d
e
0 0.01 0.02 0.03 0 0.01 0.02
0.5
0
0.5
time (s)
a
m
p
l
i
t
u
d
e
0 1000 2000 3000 4000 5000
30
20
10
0
10
20
30
frequency (Hz)
l
o
g

a
m
p
l
i
t
u
d
e

(
d
B
)
spectrum
spectral envelope
b) a)
c)
d)
Figure 2.5 Properties of the Linear Prediction Residual: a) speech frame, b) LP residual ,c) speech spectrum
and LP spectral envelope, d) LP residual spectrum
synthesis applications. The coupling between LP lters and LP residuals [90, 91] is also a limiting
factor in voice conversion implementations, where the transformation of the LP lter coefcients
makes the original LP residuals no longer appropriate.
2.1.3 Sinusoidal Models
Sinusoidal Models assume the speech waveform to be composed of the sum of a small number
of sinusoids with time-varying amplitudes, frequencies and phases. Such modelling was mainly
developed by McAulay and Quatieri [66, 77] in the mid-1980s, as an alternative to the LP
model capable of producing synthetic speech perceptually indistinguishable from the original.
Variations and extensions to the original formulation have resulted in alternative sinusoidal
models, such as ABS/OLA and HNM, which are widely employed nowadays.
2.1.3.1 The McAulay-Quatieri Model
In this model, the speech signal s(n) is described as
s
n
=
L
l=1
A
l
(t) cos
l
(t) , (2.5)
where A
l
(t) and
l
(t) denote the amplitude and the phase of the l-th sinusoidal component
respectively and where the phase
l
(t) is given by the integral of the instantaneous frequency of
the lth sinusoid w
l
(t)
l
(t) =
_
t
0
l
()d +
l
(0) . (2.6)
The model parameters A
l
(n) and
l
(n) are calculated by peak-peaking the Discrete Fourier
Transform of windowed short frames. Then, a nearest neighbour matching algorithm is used
to relate the sinusoid frequencies of one frame to the next and parameter tracks are created
interpolating the matched parameters over time using polynomial functions. Once the tracks are
obtained, resynthesis can be achieved by substituting the amplitude and frequency parameters
in equations (2.5) and (2.6).
2.1.3.2 The ABS/OLA Model
George and Smith [39] extended the basic sinusoidal model proposed by McAulay and Quatieri
to incorporate an analysis-by-synthesis technique for estimating the model parameters more
accurately and a computationally more efcient overlap-add synthesis process which prevents
the need for parameter tracking (hence the name ABS/OLA). In this model, the speech signal is
approximated by
s
n
= (n)
k=
s
(n kN
s
) s
k
(n kN
s
) , (2.7)
where (n) is a modulating envelope sequence introduced to account for syllabic energy varia-
tions and
s
(n) is a complementary synthesis window obeying the constraint
k=
s
(n kN
s
) = 1 , (2.8)
for all n where N
s
is the length of the synthesis frame and s
k
(n) is the kth synthetic contribution
given by
s
k
(n) =
J(k)
l=1
A
k
l
cos(
k
l
n +
k
l
) , (2.9)
J(k) being the number of sinusoidal components in frame k, and A
k
l
,
k
l
and
k
l
the sinusoidal
amplitudes, frequencies and phases of the kth frame respectively.
Given the sinusoidal frequencies
l
, it is possible to estimate the optimal amplitude A
l
and
phase
l
parameters which minimise the mean-squared error between the analysed signal and
the model, by recursively approximating s(n) adding a single component at a time to minimise
the successive error sequence
e
l
(n) = e
l1
(n) s
l
(n)
= (s(n) s
l1
(n)) A
l
cos(
l
n +
l
) .
(2.10)
The optimal sinusoidal frequencies
l
are exhaustively searched over a set of candidate frequen-
cies uniformly spaced between 0 and . This analysis-by-synthesis algorithm provides better
estimates of the sinusoidal components in the signal than the peak-peaking method, resulting in
a lower resynthesis mean-squared error.
Synthesis is then performed using overlap-add. Given a symmetric synthesis window, a
synthesis frame of N
s
samples is generated by
s(n +kN
s
) = (n +kN
s
)[
s
(n) s
k
(n) +
s
(n N
s
) s
k+1
(n N
s
)] , (2.11)
for 0 n < N
s
, with typical values of N
s
lying between 5 and 20ms.
2.1.3.3 The HNM Model
A method to simplify the general sinusoidal modelling is to make the quasi-periodic nature of the
speech wave explicit in the model. Assuming an harmonic relationship, the sinusoidal parame-
ters can actually be estimated from the Discrete Fourier Transform directly, avoiding the need for
peak-peaking or analysis-by-synthesis parameter estimation procedures. Many newer sinusoidal
representations [55, 48, 106] have followed this approach. Among these, the Harmonic plus
Noise Model (HNM) [55, 89] is one of the most evolved techniques.
The HNM model assumes the speech signal to be composed of a harmonic part and a noise
part, the harmonic part accounting for the quasi-periodic component of the speech signal and
the noise part modelling the non-periodic components (i.e. aspiration noise). The spectrum is
divided into two bands, determined by the so called time-varying maximum voiced frequency
F
m
. In the lower band, the signal is solely represented by harmonically related sinusoids
h(t) =
K(t)
k=1
A
k
(t) cos(k(t) +
k
(t)) , (2.12)
with (t) =
_
t
0
(l)dl, A
k
(t) and
k
(t) being the amplitude and phase at time t of the kth
harmonic respectively,
0
(t) the fundamental frequency and K(t) the time-varying number of
harmonics. The upper band contains the noise part, whose frequency content is modelled by
an auto-regressive model and its time-domain structure is imposed by a parametric amplitude
envelope.
As a result of the harmonic assumption, the rst step of the analysis involves estimating the
fundamental and the maximum voiced frequencies using a standard time-domain pitch detec-
tor and a peak-peaking algorithm. The analysis is then done pitch-synchronously on the voiced
portions of speech and at a xed rate (10ms) on the unvoiced segments. The sinusoidal ampli-
tude A
k
and phase
k
parameters of voiced frames are estimated using a weighted least-square
error minimisation criterion, solved by inverting an over-determined system of linear equations.
Regarding the noise part, standard correlation methods are used to calculate the lter and a
triangular-like function is generally used as the time-domain amplitude envelope.
Synthesis is performed pitch-synchronously in an overlap-add fashion. The harmonic part is
synthesised by directly applying equation (2.12), where the amplitudes and unwrapped phases
of the harmonics have been previously linearly interpolated between successive frames. The
noise component measured during analysis is generated by ltering white noise through a time-
varying normalised lattice lter, whose coefcients are derived from the stream of the lter
coefcients. The triangular-like time-domain amplitude envelope is then applied to the ltered
noise signal. Finally, the synthetic speech waveform is obtained by adding the harmonic and the
noise parts together.
2.1.3.4 Modication of Speaker Identifying Features
The biggest advantage of the Sinusoidal Models described above is the quality of the resynthe-
sised speech, which is typically indistinguishable from the original. As a result, they have often
been preferred over the LP model in applications involving high-quality speech synthesis.
However, because the sinusoidal amplitudes, frequencies and phases are unrelated to the
process of speech production, sinusoidal modelling is less exible than the source-lter repre-
sentation when modifying the speaker identifying spectral and prosodic features. For this reason,
sinusoidal models generally adopt a source-lter formulation to carry out pitch-scale, time-scale
and spectral transformations.
Pitch and Time-Scale Modication
Pitch and time-scale modications within sinusoidal models are accomplished by rst decoupling
the source and lter contributions of the speech signal, and then modifying them separately.
Generally, the vocal tract lter magnitudes are calculated using LP and its phases are derived
by homomorphic deconvolution, based on a minimum phase assumption. The excitation am-
plitudes and phases are then dened as the residual amplitudes and phases remaining after the
vocal tract effects are removed from the originally estimated sinusoidal parameters.
For time-scale modication, the rate of change of the vocal tract amplitudes and phases is
directly scaled while the excitation parameters are modied so that the frequency trajectories
are stretched or compressed, but the pitch is kept unchanged. This involves time-scaling the
excitation amplitude and frequency tracks and then recalculating the excitation phases from the
modied frequency values.
For pitch-scale modication, the frequency track is rst scaled by the desired factor. Then, the
excitation amplitude components are shifted to the new frequencies and their phase parameters
are recalculated at the modied frequency values. Finally, the vocal tract system parameters are
recomputed by resampling the amplitude and phase functions at the new frequencies.
The phase coherence of the excitation sine waves has been found to be extremely important
for achieving high quality modications. In fact, if the correct relationship between the modi-
ed phase components is not preserved, the resulting speech is perceived to have a reverberant
quality. In order to maintain the high quality typical of sinusoidal modelling, sinusoidal mod-
els have had to develop methods to cope with the phase incoherency problem resulting from
sinusoidal parameter modications.
McAulay and Quatieri [78] solved this problem with the introduction of the pitch pulse onset
time, dened as the location where a pitch pulse occurs in a frame and where all the sinusoids are
expected to add coherently in phase. This assumption constrains the set of excitation component
phases to obey the following linear function of frequency
k
l
(t) = (t t
k
0
)
l
(t) , (2.13)
where t
k
0
is the pitch pulse onset time measured with respect to the kth frame boundary, instead
of (2.6). This method requires the estimation of the onset time in each frame, but is capable of
obtaining shape invariant pitch and time-scale modications in which a new sequence of onset
times t
0
must be found for the recalculation of the modied excitation phases.
Based on the same concept, the ABS/OLA model achieves interframe phase coherence by
imparting time shifts derived from the pitch onset times to the modied sinusoidal components.
In addition, George and Smith rene pitch-scale modication by substituting the frequency scal-
ing of the excitation amplitudes with a phasor interpolation technique, which interpolates and
resamples the excitation magnitude spectrum instead. This method has been shown to improve
the quality of the pitch transformations.
The pitch-synchronous analysis and overlap-add synthesis employed by HNM mitigates the
discontinuities at the frame boundaries and avoids the need for the pitch pulse onset times
employed in the previously described pitch-asynchronous approaches. On the other hand, the
automatic detection of pitch marks required by pitch-synchronous methods is still an unsolved
problem which often produces errors resulting in perceptual artifacts. Pitch and time-modied
HNM parameter tracks are obtained using a technique inspired by Pitch-Synchronous Overlap
Add (PSOLA) [70] and for pitch-scale modication, amplitude and phase spectral envelopes are
calculated and resampled at the new frequencies as in the ABS/OLA model.
Spectral Modication
In theory, the sinusoidal formulation allows a direct and independent modication of the mag-
nitudes and phases of the speech spectrum, by simple alteration of the model parameters. How-
ever, in practice direct transformation of the sinusoidal parameters is troublesome, due to the
high number of parameters involved. For example, analysis of a short segment of speech with
F
0
= 100 Hz and a sampling frequency of F
s
= 22, 050 Hz would involve 110 complex or 220
real parameters.
Hence, the transformation of the spectral characteristics within sinusoidal modelling is car-
ried out by applying LP to obtain a more compact spectral envelope representation of the si-
nusoidal magnitude speech spectrum, reducing the dimensionality of its encoding to typically
between 10 and 30. In this manner, spectral envelopes can be modied in a more exible and
efcient way.
2.2 Feature Transformations
Most work to date on voice conversion has focussed on transforming the spectral envelope,
that being the major speaker-identifying segmental characteristic [47, 38, 46, 54]. However,
transformation of the voice source and prosodic features has also been found to be important
and has thus also been explored. The next three sections detail prior approaches to spectral
envelope, voice source and prosody conversion.
2.2.1 Spectral Envelope Conversion
The rst step towards the conversion of the vocal tract characteristics between source and target
speakers is the appropriate choice of the parameters used to represent the spectral envelope.
The most popular acoustic features used in spectral envelope transformation included formant
frequencies [3] and cepstrum coefcients [57], until Arslan et al. [9] introduced line spectral
frequencies (LSF) for the representation of the vocal tract characteristics. LSFs have been shown
to possess very good linear interpolation characteristics and to relate well to formant location
and bandwidth and thus, are the most commonly employed nowadays to represent spectral
envelopes and yield higher quality conversions [48, 94, 106].
LSFs can be derived from the all-pole vocal tract lter coefcients
H(z) =
1
A(z)
=
1
1 +
p
k=1
a
k
z
k
, (2.14)
using the following representation
A(z) =
1
2
(P(z) +Q(z)) , (2.15)
where
P(z) = A(z) z
(p+1)
A(z
1
) (2.16)
Q(z) = A(z) +z
(p+1)
A(z
1
) (2.17)
The LSF parameters are the complex zeros of polynomials P(z) and Q(z). Because when H(z)
is stable all zeros are on the unit circle and interlaced with each other, LSF parameters can be
expressed as a sorted list of angles or radian frequencies, which can be easily modied without
compromising the stability and minimum-phase properties of the all-pole lter.
Since the frequency resolution of the ear has been shown to be greater at low frequencies
than at high frequencies, the use of a frequency scale taking the non-uniform sensitivity of the
human ear into account should improve the perceptual quality of the conversions. As a result,
spectral envelopes are often warped to a non-linear scale before conversion [48, 104]. The
Bark scale is one of the most commonly used frequency warping methods. In such a scale, the
relationship between the perceptual Bark b and the linear frequency scale f is given by
b(f) = 6 log
_
_
f
1200
+
_
f
1200
_
2
+ 1
_
_
. (2.18)
An example of a linearly and bark scaled spectral envelope is shown in Figure 2.6.
0 0.5 1 1.5 2 2.5 3
20
15
10
5
0
5
10
15
20
linear frequency (rad)
l
o
g

a
m
p
l
i
t
u
d
e

(
d
B
)
Bark warped
Linear scaled
Figure 2.6 Frequency warping example
In addition, because unvoiced sounds generally contain little vocal tract information and
their spectral envelopes present high variations, voice conversion systems generally only trans-
form voiced speech segments and do not modify unvoiced parts.
Initial voice conversion systems achieved spectral envelope conversions by mapping code-
books [3, 57]. In this approach, Vector Quantization (VQ) is used to partition the source and
target feature spaces and create codebooks whose codewords represent a one-to-one correspon-
dence (see Figure 2.7). The fundamental problem with this technique is that only a discrete
set of target features is possible and such a restriction in the variability of the speech envelopes
causes discontinuities and reduces the quality of the converted speech. Weighted-VQ [54] was
proposed to overcome this limitation, i.e. representing the input feature vector as a combination
of the single nearest codewords instead of the nearest codeword to reduce discontinuities, a
technique which some more recent systems [9, 94] still use to achieve conversions (see Figure
2.8).
However, it was the development of continuous probabilistic modelling and transformation
by Stylianou et al. [88] that led to a considerable improvement in conversion performance.
They proposed the use of Gaussian Mixture Models (GMMs) to model and transform source and
target feature distributions.
Describing the source feature space with a GMM allows a softer classication than that of
codebook mapping. This mapping is dened by the conditional probability that a given feature
vector x of dimension p belongs to the acoustic class C
i
of the GMM
P(C
i
|x) =

i
N(x;
i
,
i
)
m
j=1
j
N(x;
j
,
j
)
, (2.19)
where {
i
} are the mixture weights, m is the number of mixture components and
i
and
i
are
the mean and variance of the i-th GMM mixture component.
In addition, GMM modelling allows the conversion function to be dened as a mixture of
locally linear transformation functions
F(x
t
) =
m
i=1
P(C
i
|x
t
)[V
i
+
i
1
i
(x
t
i
)] . (2.20)
The conversion is thus entirely determined by the p-dimensional vectors V
i
and the p p
matrices
i
, for i = 1, , m. These can be estimated by least squares optimization of the
conversion error (when parallel source and target training data are available)
=
T
t=1
y
t
F(x
t
)
2
. (2.21)
Conversion of spectral envelopes using GMMs has been demonstrated to be more robust and
efcient than transformations based on VQ. For this reason, this approach has become popular
lsf
1
lsf
2
lsf
3
lsf
4
...
lsf
K
lsf
1
lsf
2
lsf
3
lsf
4
...
lsf
K
Source Codebook Target Codebook
src lsf conv lsf
Figure 2.7 Codebook Mapping
lsf
1
lsf
2
lsf
3
lsf
4
...
lsf
K
lsf
1
lsf
2
lsf
3
lsf
4
...
lsf
K
Source Codebook Target Codebook
src lsf
conv lsf
w
1
w
2
w
K
w
1
w
2
w
3
w
K
w
3
Figure 2.8 Weighted Codebook Mapping
and various renements have been suggested. Kain and Macon [49] proposed to estimate the
GMM on the joint density of the source and target feature distributions, rather than on the
source density only. They claim that such a modelling makes no assumptions about the target
distributions and leads to a more judicious allocation of mixture components for the regression
to yield the transformation function.
One problem all spectral envelope conversion methods share is the broadening of the spec-
tral peaks, expansion of the formant bandwidths and over-smoothing caused by the averaging
effect of the parameter interpolations. This phenomenon makes the converted speech sound
slightly mufed. In order to solve this issue, several methods have been proposed. In [8], the
bandwidths of the formants in the converted speech are modied to match those of the most
likely target codeword by altering the distance between the LSFs representing each formant. In
[106], the following perceptual lter is employed
H() =
A(z/)
A(z/)
0 < < 1 , (2.22)
where A(z) is the all-pole lter representing the spectral envelope and and are set to 1.0 and
0.94 respectively. This lter is applied to the converted spectral envelope as a post-processing
stage to narrow formant bandwidths and suppress the noise in spectral valleys. Such a lter
has been successfully used for similar purposes in speech coding applications before [15]. A
post-ltering example is illustrated in Figure 2.9. Alternatively, [92] have proposed a maximum
likelihood transformation approach which takes the global variance of the converted spectra in
each utterance into account in order to alleviate the over-smoothing problem.
0 0.5 1 1.5 2 2.5 3
20
15
10
5
0
5
10
15
20
25
frequency (rad)
l
o
g

a
m
p
l
i
t
u
d
e

(
d
B
)
before postfiltering
after postfiltering
Figure 2.9 Post-ltering example
2.2.2 Voice Source Conversion
The voice source in most state-of-the-art voice conversion systems is modelled by LP residu-
als, derived from the application of LP to reduce the dimensionality of the sinusoidal spectrum
parameterisations for spectral envelope conversion.
As mentioned previously, the LP residuals represent the errors derived from modelling the
speech spectrum with an all-pole lter. Although white in theory, in practice they contain effects
that the LP assumptions do not capture, i.e. vocal tract and voice source interactions, details
of the glottal pulse shape or even unmodelled resonances and spectral zeros. Not surprisingly,
residuals have been found to contain a signicant amount of speaker information [49] and for
this reason, efforts have been made to develop techniques for their transformation.
Examples of systems that have implemented LP residual conversions based on codebook
mappings are described in Lee et al. [57] and Arslan [8]. Lee et al. parameterised residuals
using non-linear long delay neural nets and employed VQ to build codebooks to map the source
and target residual parameter spaces. Arslan built source and target residual magnitude spec-
trum codebooks from the training data and applied a weighted-VQ technique to estimate the
excitation lter responsible for LP residual conversion.
More recent approaches exploit the correlation between spectral envelopes and residuals to
convert excitation characteristics within sinusoidal frameworks. In addition to residual mag-
nitude spectra transformation techniques, most methods also propose ways to modify residual
phases so that phase coherence is preserved.
Residual Prediction was rst introduced by Kain and Macon [50]. Based on the assumption
that spectral envelopes and residuals are correlated when only one speaker is considered, the
method consists in predicting the target LP residuals from the transformed spectral envelopes.
During training a GMM LP parameter classier and two LP residual codebooks (for magnitude
and phase spectra) are built. Each class of the classier is associated with an entry in the code-
books. A magnitude codebook entry is obtained as the normalized and weighted sum of all
residual magnitude spectra, the weights corresponding to the degree of membership of each
residual magnitude spectrum to that particular class. Entries in the phase spectra codebook are
the centroid of all the residual phase spectra corresponding to each class. To predict the residual
of a converted LP spectral envelope, rst the GMM posterior probabilities of all classes are cal-
culated. Then, the residual magnitude and phase spectrums are obtained from the codebooks.
The residual magnitude spectrum is given by a weighted sum of magnitude codebook entries,
and the phase spectrum by the most likely phase codebook entry. To alleviate the audible degra-
dation produced due to phase discontinuities, phase trajectories are unwrapped and smoothed
by zero-phase ltering.
Ye and Young [106] proposed Residual Selection and Phase Prediction as a renement to Kain
and Macons method. Instead of trying to represent an arbitrary residual magnitude spectrum
by a linear combination of a limited number of prototypes, all residual magnitude spectra seen
during training are stored in a database together with an associated feature vector composed
of line spectral frequencies and their deltas. The appropriate residual magnitude spectrum for
a converted spectral envelope is chosen so that the squared error between itself and all feature
vectors in the database is minimised. Then, instead of attempting to convert residual phases,
these are obtained from a waveform prototype predicted from the converted spectral envelopes
by estimating a codebook chosen to minimise the coding error between speech frames and their
associated waveform prototypes over the training data.
The previous two methods are based on the correlation between spectral envelopes and resid-
uals. Unfortunately, such correlation is sometimes insufcient and results in converted residuals
that change too abruptly in consecutive voiced frames, producing perceptual artifacts. To alle-
viate this problem, Suendermann et al. have proposed Residual Smoothing [90] and Residual
Unit Selection [91], techniques which look at ways of reducing frame to frame residual varia-
tions. Given a sequence of predicted residual target vectors r
1
, r
2
, , r
K
, Residual Smoothing
applies a normal distribution function to compute a weighted average over the residual vectors
in the sequence, which becomes the new converted residual estimate r
k
. The standard deviation
of the distribution is dened by the product of a voicedness degree
k
and an experimentally
determined gain
r
k
=
K
=1
N(|k,
k
) r
=1
N(|k,
k
)
. (2.23)
Residual Unit Selection extends the general Unit Selection approach widely used in con-
catenative speech synthesis to the selection of residuals from the training database, in order to
directly reduce the variability of the selected residuals before smoothing. The target cost C
t
is
dened as a weighted sum of the distances between the candidate and the target spectral en-
velope v, fundamental frequency f
0
and energy S. The squared error difference between two
consecutive normalised residuals n(r) is used as the concatenation cost C
c
C
t
(r, ( v,

f
0
,

S)) =
1
d(v(r), v) +
2
d(f
0
(r),

f
0
) +
3
d(S(r),

S) (2.24)
C
c
(r
k1
, r
k
) = (1
1
3
) S {n(r
k
) n(r
k1
)}
with n(r) =
|r r|
S(r r)
.
(2.25)
The described LP residual conversion techniques have been shown to improve the quality of
spectral envelope conversions by restoring the spectral details and speaker information lost in
the spectral envelope parameterisations. However, because LP residuals do not constitute an ad-
equate model of the glottal source, the methods described above do not provide the voice quality
transformation that source conversions require, e.g. modal, breathy or harsh vocalizations. This
prevents their effective use in systems requiring voice quality modication.
2.2.3 Prosody Conversion
While the vocal tract is known to change with phoneme articulation, pitch, duration and energy
variations are more related to suprasegmental (e.g. tone, stress), paralinguistic (e.g. mood,
emotion, attitude) and sociolinguistic (e.g. region, social group) factors. This makes their mod-
elling very difcult, since high-level linguistic and contextual information needs to be taken into
account. For this reason, methods employed for modelling and converting prosody in voice con-
version frameworks are still rather simplistic. Most systems simply apply basic pitch conversion
techniques while only a few have attempted to also convert duration and energy.
Converting Pitch
Two main approaches to pitch conversion can be found in the literature: conversion of pitch
values on a frame-by-frame basis and conversion of pitch-contours.
The simplest and most widely used pitch conversion method assumes that each speakers
fundamental frequency (F0) values belong to a Gaussian distribution with a specic mean and
variance and applies a linear transformation to match them frame-by-frame. Alternative F0 dis-
tributions and transformation functions have also been explored [14]. However, this approach
is over-simplistic and the resulting pitch-contours often fail to capture the ner intonational
structures characteristic of the target.
Results obtained with frame-by-frame pitch transformation techniques led to an investiga-
tion of methods in which, rather than just adjusting the pitch values on a frame-by-frame basis,
whole pitch-contour conversions were attempted. Due to the lack of a pitch-contour representa-
tion capable of modelling speaker-dependent intonation characteristics, the vast majority of the
proposed pitch-contour conversion methods are data-driven. [14] suggested a sentence contour
codebook approach. Compared to the frame-by-frame methods, this algorithm showed improve-
ments in more cases and a generally stronger approximation to the contours produced by the
target speaker. However, its limitation arises when trying to convert a contour that is not sim-
ilar to any of the contours in the training data. To make a more efcient use of the training
data and improve conversion performance in those cases, [94] proposed using a weighted-VQ
method to convert voiced segment pitch-contours rather than entire utterances. Recent advances
in the eld of emotion conversion have achieved successful pitch-contour transformations using
context-sensitive syllable F0 units capable of capturing linguistic and emotional pitch-contour
variations [45].
Converting Duration
Duration has not received much attention within the eld of voice conversion. Only a couple
of systems based on weighted-VQ [8, 95] have exploited the codebooks built for spectral enve-
lope conversion to compute duration statistics of the speech units in the codebooks and estimate
weighted time-scale modication factors for each frame.
Converting Energy
Energy has also received little attention. In general, voice conversion systems adjust the overall
root mean squared energy of the source utterance to that of the targets. Exceptionally, [8,
94] employ a codebook-based energy mapping approach, similar to that described for duration
conversion.
2.3 Evaluation
Transformation performance in voice conversion systems is generally evaluated using both ob-
jective and subjective measures. Objective evaluations are indicative of conversion performance
and can be useful to compare different algorithms within a particular framework. However, ob-
jective measures on their own are not reliable, since they are not directly correlated with human
perception. As a result, a meaningful evaluation of voice conversion systems requires the use of
subjective measures to perceptually evaluate their conversion outputs.
Objective evaluation
Distances between different source, transformed and target speech characteristics are the most
commonly employed objective conversion measurements.
Spectral distortions (SD) have been widely used to quantify spectral envelope conversions.
For example, Abe et al. [3] measured the ratio of spectral distortion between the transformed
and target speech and the source and target speech
R =
SD(trans, tgt)
SD(src, tgt)
. (2.26)
Stylianou et al. [88] compared the performance of different types of conversion functions using
a warped root mean square (RMS) log-spectral distortion measure and similar spectral distortion
measures have been reported by other researchers [50, 105].
In addition, excitation spectrum, RMS-energy, F0 and duration distances have also been used
to measure excitation, energy, fundamental frequency and duration conversions [8].
Subjective evaluation
In order to check if the converted speech is recognized as the source or the target speaker, ABX
tests are most commonly used where participants listen to source (A), target (B) and transformed
(X) utterances and are asked to determine whether A or B is closer to X in terms of speaker
identity. A score of 100% indicates that all listeners nd the transformed speech closer to the
target. However, because this does not determine how similar the transformed and the target
speech are, similarity tests, i.e. asking participants to rank the similarity of stimulus pairs in a
ranking scale, are sometimes employed instead.
In addition to recognizability, the transformed speech is also generally evaluated in terms
of naturalness and intelligibility by mean opinion score (MOS) tests, in which participants are
asked to rank the transformed speech in terms of its quality and/or intelligibility.
2.4 Limitations of the current approaches
The previous sections have described the state-of-the-art in speech modelling and feature trans-
formation techniques employed in voice conversion frameworks. The existing methods have
been shown to work reasonably well and be capable of achieving convincing identity transfor-
mations when speakers with similar characteristics are converted.
However, if the conventional conversion techniques are extended to more extreme applica-
tions, e.g. accent or emotion conversion and speech repair, results are far from convincing. In
those cases, spectral envelope conversion alone is not enough and the standard voice source,
pitch, duration and energy modication approaches do not achieve the desired results.
The main limitations of current voice conversion technology stem from the lack of an appro-
priate modelling of the voice source and prosodic components.
While the combination of Sinusoidal and LP modelling generally employed in voice conver-
sion applications allows pitch-scale, time-scale and spectral envelope modications, the derived
LP residuals constitute a poor model of the voice source and prevent the quality transforma-
tions voice source conversions should achieve. The existing LP residual conversion techniques
improve the spectral envelope transformations by incorporating spectral information lost during
LP parameterisation. However, conversion of LP residuals alone cannot transform the quality of
the phonations between, for example, modal, breathy or harsh vocalizations. The use of a better
glottal source representation both during speech modelling and voice conversion is expected to
still achieve reasonable spectral envelope transformations, while increasing the exibility and
robustness of voice conversion frameworks in applications requiring the modication of voice
source features.
Regarding prosody, the models and methods employed for its conversion are quite simplistic
in general, and particularly as far as duration and energy are concerned.
The aim of this work is to advance the state of the art in these two areas of weakness in
order to make voice conversion systems more robust to extreme applications. In particular, the
modelling of the voice source and duration is rened and novel transformation approaches are
proposed. The validity of the developed conversion techniques is then tested in the context of
speech repair.
In order to make voice source conversion possible, a speech model capable of parameterising
and transforming the glottal source has been developed. Extending a convex optimization-based
joint vocal tract and voice source parameter estimation approach previously employed for high-
quality singing synthesis with vocal texture control, the proposed speech model can resynthesise
speech perceptually almost indistinguishable from the original and obtain high-quality pitch and
time-scale modications. In addition, the conversion of the spectral envelopes estimated using
the joint estimation technique is comparable to that achieved with conventional sinusoidal plus
LP modelling. The biggest advantage of the developed speech analysis-modication-resynthesis
mechanism is its capability to capture voice source quality variations, a characteristic LP and
sinusoidal models lack.
Exploiting the voice source parameterisation obtained through joint estimation analysis, a
novel method to convert source quality features has also been derived.
Regarding duration, decision trees are proposed to capture the durational characteristics of
different speakers. They have been widely used to model segmental phone durations in text-
to-speech frameworks before and are easily built from text-based features. For conversion, a
decision tree trained with target durational data is used to predict the durations of the converted
phones.
Finally, the developed joint estimation analysis-modication-resynthesis method and rened
voice source and duration modelling and transformation approaches have been applied to repair
the deviant glottal source and duration characteristics of tracheoesophageal speech, testing the
effectiveness of the proposed solutions in an extreme voice conversion application.
3
Joint Estimation Analysis Synthesis
In this chapter, the Joint Estimation Analysis Synthesis (JEAS) model developed for the anal-
ysis, modication and synthesis of speech is presented. Its biggest advantage is the automatic
and simultaneous parameterisation of the vocal tract and the voice source, which allows the
manipulation not only of spectral envelopes, but of glottal characteristics as well. In addition,
it also supports high-quality pitch and time-scale modications. The employed voice source
model and source-lter deconvolution technique, and the way analysis, synthesis and prosodic
transformations are implemented are described in detail in the following sections.
3.1 JEAS Model
The developed JEAS model is illustrated in Figure 3.1. It follows the general Source-Filter rep-
resentation introduced in Section 2.1.2, employing white gaussian and amplitude-modulated
white gaussian noise to model the Turbulence and Aspiration Noise components respectively, a
digital differentiator for Lip Radiation and an all-pole lter to represent the Vocal Tract. How-
ever, instead of simplifying the modelling of the voice source with a two-pole lter as in LP, the
Liljencrants-Fant (LF) model [33] is adopted to better capture the characteristics of the deriva-
tive glottal wave. Then, in order to estimate the different model component parameterisations
from the speech wave, a joint voice source and vocal tract parameter estimation technique based
on Convex Optimization is applied.
The main renements of JEAS modelling with respect to the LP formulation of the Source-
Filter model are thus the use of the LF voice source model and the joint source-lter deconvolu-
tion technique.
3.1.1 Modelling the Voice Source: LF model
Numerous parametric models of the glottal source have been proposed in the literature. Despite
their differences, they all share many common features and can be described by a small set
of parameters. In most cases, they exploit the linearity and time-invariance properties of the
27
CHAPTER 3. JOINT ESTIMATION ANALYSIS SYNTHESIS 28
LF model
Amplitude-Modulated
White Gaussian
Noise
Derivative Glottal Wave
High-pass
Aspiration Noise
White Gaussian
Noise
Turbulence Noise
All-pole
Filter
s(n)
Vocal Tract
V/U
Glottal Source
Figure 3.1 JEAS Model
Source-Filter representation and assume the commutation of the vocal tract and lip radiation
lters to combine the modelling of the source excitation and lip radiation in the parameterisation
of the derivative of the glottal waveform as shown in Figure 3.3. Overviews of some of these
models can be found in [36, 40, 26].
Among the existing glottal wave parameterisations, the LF model [33] seems to have become
the main model employed in research on the glottal source. It has been shown to be capable
of modelling a wide range of naturally occurring phonations [18, 17] and the effects of its
parameter variations are well understood.
The LF model is a four-parameter time-domain model of one cycle of the derivative glottal
waveform. Typical LF pulses corresponding to glottal and derivative glottal waves are shown in
Figure 3.4. It can be described by the following equations:
g(n) =
_
E
0
e
n
sin (
g
n) 0 n < T
e
E
e
T
a
_
e
(nT
e
)
e
(T
c
T
e
)
T
e
n < T
c
T
0
. (3.1)
The model consists of two segments: the rst one characterises the derivative glottal wave-
formfromthe instant of glottal opening to the instant of main excitation T
e
, where the amplitude
reaches the maximum negative value E
e
. As shown in equation (3.1), the segment is a sinu-
soidal function which grows exponentially in amplitude, F
g
=

g
2
being the frequency of the
sine function and determining the rate of the amplitude increase. E
0
is a scaling factor used
Glottal Wave
Aspiration
Noise
Vocal Tract
Filter
Lip Radiation
1-z
-1
s(n)
Figure 3.2 Modelling the Glottal Wave
Vocal Tract
Filter
1-z
-1
Aspiration
Noise
s(n)
1-z
-1 Glottal Wave
Derivative Glottal
Wave
High-pass
Aspiration Noise
Figure 3.3 Modelling the Derivative Glottal Wave
to ensure that the signal has a zero mean. The timing parameter T
p
is related to the sinusoidal
frequency through T
p
=
1
2F
g
and denotes the instant of the maximum glottal ow.
The second segment models the closing or return phase from the main excitation T
e
to the
instant of full closure T
c
using an exponential function. The duration of the return phase is
thus determined by T
c
T
e
. The main parameter characterising this segment is T
a
, which rep-
resents the effective duration of the return phase. This is dened by the duration from T
e
to the point where a tangent tted at the start of the return phase crosses zero.
1
is the
time-constant of the exponential function, and can be determined iteratively from T
a
, T
e
and
T
c
through =
1
T
a
(1 e
(T
c
T
e
)
). T
0
corresponds to the fundamental period. Generally, T
c
is made to coincide with the opening of the following pulse. This fact might suggest that the
model does not account for the closed phase of the glottal waveform. However, for reasonably
small values of T
a
, the exponential function will t closely to the zero line providing a closed
phase without the need for additional control parameters.
T
p
T
e
T
c
T
0
T
a
glottal LF
waveform
derivative glottal LF
waveform
E
e
Figure 3.4 LF model
Along with E
e
, the LF pulse can be uniquely determined by the following four timing param-
eters: (T
p
, T
e
, T
a
, T
c
). These parameters can be easily identied from the estimated derivative
glottal wave. Therefore, they are generally obtained rst and the synthesis parameters, from
which the LF waveform can be computed directly, (E
0
, ,
g
, ) are then derived taking the fol-
lowing constraints into account:
_
0
T
g(t) dt = 0 (3.2)
g
=

T
p
(3.3)
T
a
= 1 e
(T
c
T
e
)
(3.4)
E
0
=
E
e
e
T
c
sin(
g
T
e
)
(3.5)
More details regarding the implementation of the LF model can be found in [58].
Important LF parameters
The main LF parameters used in this thesis are the set of normalised timing parameters R
g
, R
k
,
and R
a
, which are correlated with the most salient glottal phenomena, i.e. glottal pulse width,
skewness and abruptness of closure [32]. They are dened as
R
g
=
T
0
2 T
p
(3.6)
R
k
=
T
e
T
p
T
p
(3.7)
R
a
=
T
a
T
0
(3.8)
R
g
is a normalised version of the glottal formant frequency F
g
, which is dened as the inverse
of twice the duration of the opening phase T
p
[29]. R
k
is the LF parameter which captures glottal
asymmetry. It is dened as the ratio between the times of the opening and closing branches of
the glottal pulse, and the larger its value, the more symmetrical the pulse is. The relationship
between R
g
, R
k
and the Open Quotient OQ is: OQ = (1 + R
k
)/(2 R
g
). Thus, OQ is positively
correlated with R
k
and negatively correlated with R
g
. The R
a
parameter corresponds to the
effective return time T
a
normalised by the fundamental period and captures the differences
relating to the spectral tilt.
E
e
is another important LF parameter, closely related to the strength of the source excitation
and the main determinant of the intensity of the speech signal. Its variation affects the overall
harmonic amplitudes, except the very lowest components which are more determined by the
shape of the pulse.
3.1.2 Source-Filter deconvolution
The aim of Source-Filter deconvolution is to obtain estimates of the glottal source and vocal
tract lter components from the speech wave. Two main deconvolution approaches exist. Be-
fore parametric models of the glottal waveform were developed, Inverse Filtering (IF) was the
most commonly employed deconvolution method. It is based on calculating a vocal tract lter
transfer function, whose inverse is used to obtain a glottal waveform estimate which can then
be parameterised. A different approach involves modelling both glottal source and vocal tract
lter, and developing techniques to jointly estimate the source and tract model parameters.
3.1.2.1 Inverse Filtering and Glottal Parameterisation
The basic principle of inverse ltering consists in applying a lter to the speech signal with a
transfer function which is the inverse of the vocal tract. This way, the effect of the vocal tract
is cancelled and the output of the inverse lter produces an estimate of the glottal waveform
(see Figure 3.5). For inverse ltering to be effective, obtaining a good estimate of the vocal
tract transfer function is very important. In fact, the shape of the glottal pulse is very sensitive
to errors in the frequencies and bandwidths of the formants (particularly the rst formant),
resulting in ripples, bumps and non-at closed-phases if these are not accurate.
Methods to estimate the vocal tract formant frequencies and bandwidths for IF purposes have
evolved from manual analogue to digital and automatic. Initial voice source studies constructed
analogue networks which were inverse to the transfer characteristic of the vocal tract [68]. In
these systems, the user had to manually adjust the values of the resistors and the capacitors
controlling the formant frequencies and bandwidths in order to obtain a maximally at closed-
phase. With the development of digital technology, analogue networks were substituted by
computer-implemented digital lters [44] which still required human interaction to decide what
lter settings to try and to judge the results.
The introduction of LP into the analysis and synthesis of speech [10] allowed the automa-
tization of IF approaches, since LP coefcients could be used as an estimate of the vocal tract
characteristics to automatically determine the all-pole inverse lter. However, the original LP
analysis methods (i.e. autocorrelation, covariance) failed to obtain acceptable glottal source
estimates mainly because they also encode part of the glottal source characteristics due to their
formulation. Thus, the problem then became dening a LP formulation capable of reducing
the inuence of the voice source and providing precise formant frequencies and bandwidths for
the purpose of IF and accurate glottal waveform estimation. The following two methods are
the most widely employed methods for automatic IF nowadays: Closed-Phase Covariance LP
(CPCLP) and Iterative Adaptive Inverse Filtering (IAIF).
CPCLP [103] is based on the assumption that during the closed-phase there is no interac-
tion between the glottal system and the vocal tract and that, therefore, during that phase
the speech waveform becomes a freely decaying oscillation which is strictly a function of
the vocal tract resonances. If covariance LP is applied to just the closed-phase portion of
the speech waveform, the estimated LP coefcients should only represent the vocal tract
lter without any inuence of the glottal source.
The main difculty within this approach lies in the determination of the closed phase from
the speech pressure wave. Several algorithms to obtain reliable glottal closure instants
(GCI) directly from the speech signal based on LP residual energy [103], Frobenius Norm
[63], group delay [84], Kalman ltering [67] and/or dynamic programming [71] have
Derivative glottal wave Vocal Tract Filter
Speech wave
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
-25
-20
-15
-10
-5
0
5
10
15
20
25
frequency (Hz)
log am
p
litude (dB
)
Speech Production
Inverse Filtering
0 0.0050 0.01 0.015 0. 02 0. 025 0.03
-0. 2
-0. 1
0
0. 1
0. 2
time (s)
a
m
p
litu
d
e
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
-100
-50
0
50
frequency (Hz)
lo
g
a
m
p
litu
d
e
(d
B
)
0 0.0050 0.01 0.015 0.02 0.025 0. 03
-0.2
-0.1
0
0.1
0.2
ti me (s)
a
m
p
litu
d
e
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
-100
-50
0
50
frequency (Hz)
lo
g
a
m
p
litu
d
e
(d
B
)
0 0.005 0.01 0. 015 0.02 0. 025 0.03
-0.4
-0.2
0
0.2
0.4
time (s)
a
m
p
litu
d
e
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
-100
-50
0
50
frequency (Hz)
lo
g
a
m
p
litu
d
e
(d
B
)
0 0.005 0.01 0. 015 0.02 0. 025 0.03
-0.4
-0.2
0
0.2
0.4
time (s)
a
m
p
litu
d
e
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
-100
-50
0
50
frequency (Hz)
lo
g
a
m
p
litu
d
e
(d
B
)
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
-25
-20
-15
-10
-5
0
5
10
15
20
25
frequency (Hz )
log am
plitude (dB
)
Figure 3.5 Inverse Filtering
3. LPC order 10-20
6. LPC order 2-4
1. LPC order 1
4. IF
2. IF
5. Integration
7. IF
9. IF
8. LPC order 10-20
10. Integration
H
vt1
(z)
gp1(n)
gp2(n)
s(n)
H
vt2
(z)
Figure 3.6 Iterative Adaptive Inverse Filtering
been proposed in the literature. Additional electroglottographic (EGG) recordings have
also been shown to correlate well with glottal motion and can be used to obtain fairly
accurate estimates of glottal closure [69].
Despite being a popular technique, CPCLP gives reliable results only in the case when the
glottal source has a sufciently long closed-phase. To mitigate this, multi-cycle closed-
phase covariance inverse ltering [12], i.e. applying covariance linear prediction analysis
to two or more successive closed phases, has been proposed to improve the accuracy of
the single closed-phase method when the closed phase is too short, as it is often the case
for high-pitched female speech.
IAIF [5] is instead based on eliminating the tilting effect of the glottal source from the
speech spectrum in an iterative and adaptive way, before applying LP analysis. First, the
overall spectral slope of the glottal waveform is modelled using order one LP. In the rst
iteration, the estimated contribution of the source is removed from the speech spectrum by
IF and an initial estimate of the vocal tract lter vt1 is calculated by applying LP analysis
to the signal from which the effect of the source was cancelled. Then, by ltering the
speech signal with the inverse of vt1 and integrating to cancel the lip radiation effect, the
rst estimate of the glottal pulse gp1 is obtained. From gp1, a more accurate estimate
of the source spectral tilt can be calculated applying LP analysis of typically order 2 or
4. Finally, the process of removing the more precise voice source contribution from the
speech spectrum, calculating a rened vocal tract lter vt2 via LP and inverse ltering is
repeated to obtain the nal IAIF glottal waveform estimate gp2.
As opposed to the CPCLP technique, IAIF does not rely on the determination and existance
of a closed-phase and is capable of providing fairly good glottal estimates even for high-
pitched female voices. As a renement, the use of alternative LP methodologies such as
Discrete All-Pole modelling (DAP) [27] in substitution of the conventional LP techniques
has shown to improve the accuracy of the estimation of the vocal tract transfer function,
and as a result, of the estimated glottal waveforms [7].
Both CPLP and IAIF work reasonably well in steady-state vowels and segments with low F0
and modal phonations. However, their performance deteriorates particularly in segments with
higher F0, non-modal phonations and in transitions found in continuous speech where the vocal
tract undergoes a rapid change. In those cases, manual optimization of the estimated vocal tract
characteristics often greatly improves IF results. For this reason, most recent studies of the voice
source [41, 4] use analysis methodologies where automatically estimated formant frequencies
and bandwidths are interactively adjusted to achieve best results.
After an estimate of the glottal source has been obtained, it is generally parameterised. The
simplest way of capturing the most important glottal features is to apply time-domain mea-
surements to determine the closed, opening and closing phases of the glottal and/or glottal
derivative waveforms and quantify the Open Quotient (OQ), Speed Quotient (SQ) and Closing
Quotient (CQ) parameters. However, time instants of glottal opening and closure are sometimes
difcult to extract exactly due to formant ripple and noise and thus, accurate computation of
their values is in practice problematic. To cope with this problem, alternative parameters based
on more robust amplitude-domain measurements, which have been shown to capture the effec-
tive declination time and be correlated to CQ, have been proposed. Among these, AQ, NAQ [6]
and R
d
[30, 31] have become increasingly popular. These parameterisations are capable of mea-
suring glottal waveform variations and have been successfully applied in voice source analysis
studies. However, they do not constitute a model of the glottal source and cannot be used for
speech modication purposes.
A way of parameterising the glottal waveform obtained by IF using a glottal source model is
to apply a tting procedure to determine the optimal underlying mathematical model parameter
values that match its shape. For the particular case of LF model tting, a method was proposed
in [85, 86] which has been widely applied and will be discussed in more detail in Section 3.2.3.
3.1.2.2 Joint Source-Filter Estimation
An alternative to inverse ltering and glottal parameterisation is the simultaneous estimation of
glottal source and vocal tract model parameters from the speech wave. Due to the characteris-
tics of the mathematical source and tract descriptions, such an approach is a complex nonlinear
problem. For this reason, the use of LP has been deployed more widely as a simpler method to
obtain a direct and efcient source-lter parameterisation of the speech signal. Its poor mod-
elling of the voice source has not limited its application in the context of speech coding and to
efciently represent the speech spectrum with a small number of parameters. However, it has
prevented its use in speech synthesis and transformation applications. Advances in voice conver-
sion and Hidden Markov Model (HMM) speech synthesis in the last few years have emphasized
the importance of rened vocoding and thus, the problem of automatic joint estimation of voice
source and vocal tract lter parameters has gained renewed interest.
Several joint estimation algorithms have been proposed in the literature. A summary of their
most relevant characteristics is shown in Table 3.1. Most of them model the voice source with
one of the existing glottal waveform models and the vocal tract as an all-pole or pole-zero lter,
dene an error criterion to minimise and describe techniques to solve the resulting generally
non-linear problem.
For example, Fujisaki et al. [37] use a pole-zero lter to model the vocal tract and a model
proposed by themselves [36] for the glottal source. They dene an error function which is
linear in the vocal tract lter parameters and which allows a simple estimation of the spectral
envelope provided that the glottal source parameters are known in advance. The voice source
parameterisation is then obtained using an iterative search procedure to minimise the global
prediction error.
Krishnamurthy [53] describes the glottal waveform using the LF model reformulated with
complex exponentials and the vocal tract as a pole-zero system with a different set of values in
the closed and open phases to capture the source-tract interaction. Each of these models is then
estimated using a two-step least-square error (LSE) criterion approach.
Various other authors represent the speech production process as a time-varying auto-regre-
ssive model with an exogenous input (ARX). In [24], the lter excitation is approximated by the
Rosenberg-Klatt (RK) [52] model, Kalman ltering is used to estimate the vocal tract formants-
antiformants and Simulated Anealing (SA) is employed to search for the best set of voice source
parameters to minimise the mean-squared estimation error of the ARX formulation.
Fu and Murphy [35] also use the ARX model and Kalman ltering in their joint estimation
approach, but represent the source with the LF model and dene a LSE minimisation criterion
instead. In order to solve the resulting non-linear optimisation problem, they construct a simpler
approximate problem based on pitch-synchronous covariance LP and RK model tting, whose
global solution is taken as the initial starting point of a descent algorithm used to robustly solve
the initial and more complex problem.
Still in the context of ARX and LF modelling, Vincent et al. [101] dene a LSE criterion that
can be easily solved for a given set of source parameters, which are optimised by an exhaustive
evaluation of the error over a constructed nite subspace of all the possible source parameter
vectors. In addition, they split the analysis into low and a high frequency stages, in order
to estimate the glottal source parameters related to the glottal formant and the spectral tilt
separately and more accurately.
Lu and Smith [61] have proposed Convex Optimization to jointly estimate the voice source
and vocal tract parameters. Their approach models the derivative glottal waveform with the
Rosenberg-Klatt (RK) model [52], the vocal tract as an all-pole lter, and assumes that glottal
closure instants can be obtained accurately. In addition, they dene the error criterion as the
difference between the estimated and the true glottal waveforms. The main advantage of this
method is that the resulting problem does not require iterative search nor robust initialization
procedures and can be relatively easily solved using Quadratic Programming. Once the optimal
set of vocal tract coefcients has been estimated, they use inverse ltering and LF model tting
to rene their modelling of the glottal source.
Using a different approach, Frohlich et al. [34] modify the formulation of the Discrete All-
Pole (DAP) modelling method to cancel the effect of the differentiated glottal ow in the vocal
tract lter parameter calculation. They do so by modelling the voice source with the LF model
and applying multidimensional optimization techniques to nd the set of LF parameters which
minimise the Itakura-Saito error for discrete signals employed in conventional DAP.
It is hard to evaluate and compare the described joint estimation approaches. The perfor-
mance of each method is evaluated in a different way, generally comparing the obtained glottal
source parameterisations with the values used to generate synthetic test speech or measuring
the correlation between the estimated parameters and EGG-based features (GCI, OQ) of natural
speech segments. The different techniques have not been compared in terms of computational
complexity either. In addition, most methods have just been applied in analysis frameworks.
Only Lu and Smiths Convex Optimization approach [62] and Vincent et al.s ARX-LF model
[102] have been employed for speech synthesis and modication purposes. The Convex Opti-
mization technique has been shown to be capable of synthesising high-quality singing speech
with different qualities, generated by adequate modication of the estimated glottal source pa-
rameters. The ARX-LF model, on the other hand, has only been tested for time and fundamental
frequency modications so far.
Method Source Vocal Tract Error Estimation
Model Model Methods
[37] FL86 pole-zero s s Iterative Search of FL86
LSE
[53] LF pole-zero s
O
s
O
OC Detection
s
C
s
C
LSE
[24] RK pole-zero s s Simulated Anealing of RK
ARX Kalman Filtering
LSE
[35] LF pole-zero s s Initialisation: CPCLP+RK tting
ARX Descent Algorithm of LF
Kalman Filtering
LSE
[101] LF all-pole s
LF
s
LF
Subspace Search of LF
ARX s
HF
s
HF
LSE
[61] RK,LF all-pole g g Convex Optimization
IF + LF tting
[34] LF all-pole Itakura Modied DAP
Saito Multi-dimensional optimization of LF
Error
Table 3.1 Joint Estimation Methods [ where s and s are the true and predicted speech signals, g and g are
the true and predicted glottal waveforms, O and C are the open and closed phases and LF and HF the low
and high frequencies respectively ] (Refer to the List of Acronyms)
Whilst Inverse Filtering techniques are still widely employed nowadays, particularly in the
eld of voice source analysis, they often require manual adjustment. Joint Estimation methods
are, however, fully automatic. This is an important condition that a mathematical model aimed
at analysis, synthesis and modication of the speech signal should meet. As a result, a joint
source-lter deconvolution approach has been followed in the speech model developed in this
thesis for voice conversion and speech repair applications. For not requiring computionally
too intensive iterative search procedures, for having been tested in a synthesis framework and
for allowing an artifact free modication of its voice source parameters, the joint estimation
technique proposed by Lu and Smith has been adopted and extended in the developed JEAS
system. The next two sections describe the details of its analysis and synthesis implementations.
3.2 JEAS Analysis
During analysis, voiced and unvoiced speech segments are processed differently due to their
diverse source characteristics. While the voice source in voiced speech is represented by a com-
bination of the LF and aspiration noise models, white Gaussian noise is used to excite the vocal
tract lter in unvoiced frames (see Figure 3.1). Their different modelling requires a preprocess-
ing step where the voiced and unvoiced speech sections are determined and the glottal closure
instants (GCI) of the voiced segments are estimated. Then, the voice source and vocal tract
parameters are obtained through joint source-lter estimation and LF re-parameterisation in
voiced sections (V ) and through standard autocorrelation LP and Gaussian noise energy (GNE)
matching in unvoiced portions (U).
3.2.1 Voicing Decision and GCI Detection
Dynamic Programming Projected Phase-Slope Algorithm (DYPSA) [71] is used for GCI estima-
tion. It employs the group-delay function in combination with a phase-slope projection method
to determine GCI candidates, plus N-best dynamic programming to select the most likely candi-
dates according to a cost function which takes waveform similarity, pitch deviation, normalised
energy and deviation from the ideal phase-slope into account. Its accuracy has been shown to
outperform that of LP residual [103], Frobenius Norm [63] and Group Delay [84] methods and
to be good enough for the purposes of the developed speech model.
The voicing decision is made based on energy, zero-crossing and GCI information. Voiced
segments are then processed pitch-synchronously, while unvoiced frames are extracted every
10ms.
3.2.2 Adaptive Joint Estimation
The method employed for joint voice source and vocal tract parameter estimation is based on the
Convex-Optimization approach rst proposed in [61] and rened in [62, 60]. It involves using
a voice source model simple enough to allow the source-lter deconvolution to be formulated as
a convex optimization problem. Then, the derivative glottal waveform obtained by IF with the
estimated lter coefcients is re-parameterised by LF model tting. The success of the technique
lies in providing a derivative glottal waveform constraint when estimating the vocal tract lter.
Because of this, the resulting IF derivative glottal waveform is closer to the true glottal excitation
and its tting to an LF model is straightforward.
The joint estimation algorithm models the voice source using the Rosenberg-Klatt model
[52], which consists of a basic voicing waveform describing the shape of the derivative glottal
wave and a low-pass lter,
1
1z
1
with > 0, as shown in Figure 3.7. The RK derivative of the
glottal waveform is given by
g(n) =
_
0 1 n < n
c
2a(n n
c
) 3b(n n
c
)
2
n
c
n < T
0
, (3.9)
where T
0
corresponds to the pitch period and n
c
represents the duration of the closed phase,
which can also be expressed as
n
c
= T
0
OQ T
0
, (3.10)
OQ being the open-quotient, i.e. the fraction of the pitch period in which the glottis is open. In
addition, the parameters a and b need to be always positive and hold the following relationship,
a = b OQ T
0
, (3.11)
in order to maintain an appropriate waveshape.
RK basic
voicing waveform
Low-pass
filter
1/(1-z
-1
)
pth order
all-pole
filter
Glottal Source
Vocal Tract
Voice Output
Figure 3.7 Joint Estimation Analysis Model
Source-lter deconvolution via convex optimization is accomplished by minimising the squared
error between the modelled and the true derivative glottal waveforms. The modelled derivative
glottal waveform g(n) corresponds to that of equation (3.9), while the true derivative glottal
wave g(n) is obtained through inverse ltering as
g(n) = s(n)
p
k=1
k
s(n k) , (3.12)
where s(n) is the speech wave and
k
are the coefcients of the vocal tract all-pole lter. The
error between the modelled and the true derivative glottal waves e(n) can be calculated by
subtracting equations (3.9) and (3.12)
e(n) = g(n) g(n)
=
_
_
0 s(n) +
p
k=1
k
s(n k) 1 n < n
c
2a(n n
c
) 3b(n n
c
)
2
s(n) +
p
k=1
k
s(n k) n
c
n < T
0
(3.13)
Rearranging the previous expression and rewriting it in matrix form we have
E =
_
_
e(1)
.
.
.
e(n
c
)
e(n
c
+ 1)
.
.
.
e(T
0
)
_
_
=
_
_
s(0) s(p) 0 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
s(n
c
1) s(n
c
p) 0 0
s(n
c
) s(n
c
+ 1 p) 2(1) 3(1)
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
s(T
0
1) s(T
0
p) 2(T
0
n
c
) 3(T
0
n
c
)
2
_
_
X
_
_
s(1)
.
.
.
s(n
c
)
s(n
c
+ 1)
.
.
.
s(T
0
)
_
_
=
_
_
f
T
1
.
.
.
f
T
T
0
_
_
X
_
_
s
1
.
.
.
s
T
0
_
_
= FX S ,
where X = [
1

p
a b ]
T
is the parameter vector to estimate so that the sum of the
squares of the equation error E is minimised, i.e.
min
X
E
2
= min
X
T
0
1
n=1
(E(n))
2
= min
X
FX S
2
. (3.14)
Lu demonstrated in [61] that the simplicity of the RK glottal model guarantees this optimiza-
tion to be convex, i.e. to only have one minimum which corresponds to the optimal solution,
and thus, efciently solvable via Quadratic Programming. A quadratic problem is dened as
follows
min
X
q(X) =
1
2
X
T
HX +g
T
X (3.15)
subject to : AX b
A
eq
X = b
eq
(3.16)
Equation (3.14) can be solved using quadratic programming if expanded to have its form,
i.e.
min
X
FX S
2
= (FX S)
T
(FX S)
= X
T
F
T
FX 2S
T
FX +S
T
S ,
(3.17)
by dening
H = 2F
T
F
g
T
= 2S
T
F
(3.18)
and ignoring the term S
T
S, which is always positive, for the purposes of minimisation. In
addition, equation 3.11 imposes the following equality and inequality constraints
a > 0
b > 0
a = b OQ T
0
.
(3.19)
0.01 0.015 0.02
0.2
0
0.2
time (s)
a
m
p
l
i
t
u
d
e
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
50
0
50
frequency (Hz)
l
o
g

a
m
p
l
i
t
u
d
e

(
d
B
)
0.01 0.015 0.02
0.1
0.05
0
0.05
time (s)
a
m
p
l
i
t
u
d
e
IFresidual
RK wave
spectrum
spectral envelope
a)
b)
c)
Figure 3.8 Joint Estimation example: a) speech period, b) speech spectrum and jointly estimated spectral
envelope, c) inverse ltered residual and jointly estimated RK wave
The derived quadratic program can be solved using a number of existing iterative numeri-
cal algorithms. In the developed implementation, the quadratic programming function of the
MATLAB Optimization Toolbox [64] has been employed. The result of the minimization prob-
lem is the simultaneous estimation of the RK model parameters a and b and the all-pole lter
coefcients
k
. Figure 3.8 shows a joint estimation example for one pitch period.
The described joint estimation process assumes that the closed and open-phases are dened,
while in practice the parameter which delimits the end of the closed-phase and the beginning
of the open-phase n
c
is unknown. Its optimal value is found by uniformly sampling the possible
n
c
values (empirically shown to vary from 0% to 60% of the pitch period T
0
[52]), solving the
quadratic problem at each sampled n
c
value and choosing the estimate resulting in minimum
error.
3.2.2.1 Encoding the Spectral Tilt
As it can be seen in Figure 3.9, the basic RK voicing waveform of equation (3.9) does not
explicitly model the return phase of the derivative glottal waveform and changes abruptly at the
glottal closure instants. For this reason, a low-pass lter was added to the basic model, with
the purpose of reducing the abruptness of glottal closure. In the frequency domain, the lter
coefcient is responsible for controlling the tilt of the source spectrum.
In order to allow the formulation of the convex optimization problem, in [61] the spectral
tilt lter is separated from the source model and incorporated to the vocal tract model by adding
an extra pole to the all-pole lter as shown in Figure 3.10. This implies that the vocal tract lter
coefcients estimated using this formulation also encode the spectral slope information of the
voice source. As a result, the derivative glottal waveforms obtained using this approach fail to
0 0.005 0.01 0.015 0.02 0.025 0.03
0.02
0.015
0.01
0.005
0
0.005
0.01
time (s)
a
m
p
l
i
t
u
d
e
Figure 3.9 RK derivative glottal wave
RK basic
voicing waveform
p +1 order
all-pole
filter
Voice Output
Glottal Source
Vocal Tract + Spectral Tilt
Figure 3.10 Spectral Tilt modelling in Lu and Smiths formulation
adequately capture the variations in the return phase of the glottal source.
In this work, adaptive pre-emphasis is used to estimate and remove the spectral tilt lter
contribution from the speech wave before convex optimization, hence the name Adaptive Joint
Estimation (AJE). Order one LP analysis and IF is applied to estimate and remove the spectral
slope from the speech frames under analysis. The effect of adaptive pre-emphasis is illustrated
in Figure 3.11. The vocal tract lter envelope estimates obtained this way do not encode source
spectral tilt characteristics, which are reected in the closing phase of the resulting derivative
glottal waveforms instead. This improves the tting of the return phase of the LF model and
thus, of the high frequencies of the glottal source.
0 1000 2000 3000 4000 5000
50
0
50
frequency (Hz)
l
o
g

a
m
p
l
i
t
u
d
e

(
d
B
)
0 1000 2000 3000 4000 5000
50
0
50
frequency (Hz)
l
o
g

a
m
p
l
i
t
u
d
e

(
d
B
)
0 0.002 0.004 0.006 0.008 0.01
0.2
0
0.2
time (s)
a
m
p
l
i
t
u
d
e
0 0.002 0.004 0.006 0.008 0.01
0.1
0
0.1
time (s)
a
m
p
l
i
t
u
d
e
0 1000 2000 3000 4000 5000
40
30
20
10
0
frequency (Hz)
l
o
g

a
m
p
l
i
t
u
d
e

(
d
B
)
0 1000 2000 3000 4000 5000
40
20
0
20
frequency (Hz)
l
o
g

a
m
p
l
i
t
u
d
e

(
d
B
)
S
SE
IF dgw
fitted dgw
IF dgw
fitted dgw
Without Adaptive Pre-Emphas is With Adaptive Pre-Emphas is
a)
b)
c)
Figure 3.11 Effect of adaptive pre-emphasis: a) Speech spectrum (S) and estimated spectral envelope (SE), b)
IF derivative glottal wave (IF dgw) and tted LF waveform (tted dgw), c) IF derivative glottal wave spectrum
(IF dgw) and tted LF wave spectrum (tted dgw)
3.2.3 LF Fitting
The LF model is capable of more accurately describing the glottal derivative waveform than
the RK model. However, its more complex nonlinear formulation fails to full the convexity
condition and prevents its use in the joint voice source and vocal tract lter parameter estimation
algorithm. Instead, the RK model is employed during source-lter deconvolution and the LF
model is then used to re-parameterise the derivative glottal wave obtained by inverse ltering
the speech waveform with the jointly estimated lter coefcients.
LF model tting is carried out in two steps. First, initial estimates of the LF timing parameters
(T
p
, T
e
, T
a
, T
c
) and the glottal excitation E
e
are obtained from the time-domain IF voice source
waveform by direct estimation methods [85, 87, 86, 72]. Then, their values are rened using
constrained nonlinear optimization [64]. The overall procedure is as follows.
The glottal excitation strength E
e
and its time index T
e
are located rst by nding the min-
imum of the IF derivative glottal waveform. Then, T
p
and T
c
are determined as the rst zero-
crossings before and after T
e
respectively. T
a
is estimated as T
a
= (T
c
T
e
) 2/3. T
p
and T
a
are further rened using constrained nonlinear minimisation. Because the initial E
e
, T
e
and T
c
estimates are quite reliable, their values are kept unchanged during optimization. T
a
is conned
to vary between 0 and T
c
T
e
and T
p
to be within 20% of its initial estimate. The return and
open phases are optimized separately and sequentially. In both cases, the minimisation function
is the sum of the squared error between the IF derivative glottal wave and the tted estimate for
the particular phase. Figure 3.12 shows an example of LF tting in normal, breathy and pressed
phonations.
0.0010 0.002 0.003 0.004 0.005 0.006 0.007
0.1
0.05
0
time (s)
a
m
p
l
i
t
u
d
e
0.001 0.002 0.003 0.004 0.005 0.006 0.007
0.06
0.04
0.02
0
0.02
time (s)
a
m
p
l
i
t
u
d
e
0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008
0.1
0.05
0
time (s)
a
m
p
l
i
t
u
d
e
IF dgw
fitted dgw
a)
b)
c)
Figure 3.12 LF tting examples: a) normal, b) breathy and c) pressed IF derivative glottal waves (dgw) and
tted LF waveforms (tted dgw)
3.2.4 Modelling Aspiration Noise
Because the LF parameterisation does not model glottal aspiration noise, the stochastic compo-
nent present in the IF derivative glottal waveform is not captured during LF tting. However,
perceptually, the lack of aspiration noise results in an unnatural speech quality and thus, a
methodology for its extraction and parameterisation has been developed within the JEAS frame-
work.
Wavelet denoising is used to extract the glottal aspiration noise from the IF derivative glottal
wave estimate. Compared to other techniques employed to identify and separate the periodic
and aperiodic components of quasi-periodic signals such as frequency transform analysis [81]
or periodic prediction [10], Wavelet Packet Analysis has been found to obtain more reliable
aspiration noise estimates [60]. Wavelet Packet Analysis is performed at level 4 with the 7th or-
der Daubechies wavelet, using soft-thresholding and the Stein Unbiased Risk Estimate threshold
evaluation criteria. Figure 3.13 shows a typical denoising result.
0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009
0.08
0.06
0.04
0.02
0
0.02
0.04
time (s)
a
m
p
l
i
t
u
d
e
0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009
0.01
0.005
0
0.005
0.01
time (s)
a
m
p
l
i
t
u
d
e
original
denoised
a)
b)
Figure 3.13 Denoising example: a) original and denoised IF derivative glottal wave, b) noise estimate
Once an estimate of the aspiration noise has been extracted, it needs to be parameterised.
Studies of aspiration noise have shown that this is synchronous with the glottal wave and likely
to present noise bursts at glottal closure and often also at glottal opening [19]. Most models ne-
glect the nature of the glottal opening pulse and approximate it as pitch synchronous amplitude
modulated Gaussian noise, with higher energy around the glottal closure instants [62, 17]. The
amplitude of the noise burst is usually modulated using Rectangular, Hanning or Hamming win-
dows. A spectral shaping lter is sometimes included to account for the average spectral density
of the aspiration noise and the high-pass ltering introduced by the commutation of the vocal
tract and radiation lters. However, various models also neglect the spectral shaping lter since
it has been found not to be perceptually important [60]. These pitch synchronous amplitude
modulated Gaussian noise approaches require the determination of the following parameters
illustrated in Figure 3.14 from the aspiration noise component:
Noise Floor (N
f
): the noise oor of the aspiration noise
Noise Pulse Amplitude (NP
a
): the amplitude modulation index of the noise pulse
Noise Pulse Position (NP
p
): the position of the center of the noise pulse window in the
glottal period
Noise Pulse Width (NP
w
): the width of the noise pulse window
Unfortunately, automatic calculation of the above parameters from the estimated aspiration
noise components is troublesome in many cases. In order to avoid these errors, a different
approach is followed in the JEAS implementation. While the aspiration component is still ap-
proximated as pitch synchronous amplitude modulated Gaussian noise, an alternative function
which does not require the estimation of N
f
, NP
a
, NP
p
and NP
w
is employed to modulate its
amplitude: the derivative glottal LF waveform. In fact, the shape of the derivative glottal LF
waveform follows the most salient amplitude modulation characteristics of glottal aspiration
noise, i.e. the magnitude of its amplitude increases during the open phase and is maximum at
glottal closure. If stationary Gaussian noise is modulated with a derivative glottal LF waveform,
the resulting signal will present the two likely aspiration noise bursts around glottal opening
and glottal closure as shown in Figure 3.15. According to informal listening tests, this approach
is comparable to the previously described window-based modelling techniques.
Thus, the aspiration noise estimate obtained for a particular pitch period during JEAS anal-
ysis is parameterised as follows. First, zero mean unit variance Gaussian noise is modulated
with the already tted derivative glottal LF waveform for that pitch period. Then, its energy is
adjusted to match that of the aspiration noise estimate ANE. Because using a spectral shaping
lter has informally been found not to make a perceptual difference, it is not included in the
parameterisation. Figure 3.16 depicts a diagram of the employed aspiration noise modelling
approach.
0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009
8
6
4
2
0
2
4
6
8
10
x 10
3
time (s)
a
m
p
l
i
t
u
d
e
Nf
NP
a
NP
p NP
w
Figure 3.14 Standard Aspiration Noise Model Parameters
0 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018
0.04
0.02
0
0.02
0.04
time(s)
a
m
p
l
i
t
u
d
e
0 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018
0.5
0
0.5
time(s)
a
m
p
l
i
t
u
d
e
0 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018
0.02
0.01
0
0.01
0.02
time(s)
a
m
p
l
i
t
u
d
e
a)
b)
c)
Figure 3.15 Gaussian Noise modulation by a derivative glottal LF waveform: a)Gaussian Noise source,
b)derivative glottal LF waveform, c)LF Modulated Gaussian Noise
Gaussian Noise
Source
LF Amplitude
Modulation
Energy
Matching
Fitted LF
waveform
Energy of
estimated
Aspiration Noise
Component
Modelled
Aspiration
Noise
Figure 3.16 Aspiration Noise Model
JEAS ANALYSIS ALGORITHM
Input: s(n)
Output: {
1
,
2
, ,
p
}, {E
e
, T
p
, T
e
, T
a
, T
c
, ANE} in Voiced (V) frames
{
1
,
2
, ,
p
}, GNE in Unvoiced (U) frames
// GCI detection
1 GCI DYPSA algorithm
2 foreach frame do
// voicing decision
3 V/U Energy, Zero-Crossings
4 if frame is V
// find optimal n
c
5 n
c
Linear Search [0 0.6 T
0
]
// remove spectral tilt using Adaptive Pre-Emphasis a
p
6 s
p
(n) = s(n) a
p
s(n 1)
// estimate filter parameters using Convex Optimization
7 {
1
,
2
, ,
p
} min
X
FX S
p

2
// inverse filter s(n) with the estimated {
1
,
2
, ,
p
}
8 dgw(n) = s(n)
p
k=1
k
s(n k)
// fit a derivative glottal LF waveform to dgw(n)
9 {E
e
, T
p
, T
e
, T
a
, T
c
} Direct Estimation, Constrained Nonlinear Minimisation
// estimate the Aspiration Noise Energy ANE of dgw(n)
10 ANE Wavelet Packet Analysis
11 elseif frame is U
// estimate filter parameters using LP Analysis
12 {
1
,
2
, ,
p
} min
n
(s(n)
p
k=1
k
s(n k))
2
// estimate the turbulence Gaussian Noise Energy GNE
13 GNE LP residual
end
end
Summary of the JEAS Analysis Algorithm
3.3 JEAS Synthesis
Synthesis is done by following the JEAS Model of Figure 3.1 and applying the parameters esti-
mated during analysis. In theory, each frame k of the speech waveform, which corresponds to
a pitch period in voiced segments and to a xed 10ms segment in unvoiced parts, can be gen-
erated by ltering the estimated voiced or unvoiced excitation signal e(n) with the vocal tract
lter vt for that particular frame
s
k
(n) = e
k
(n) vt
k
= e
k
(n)
p
i=1
i
s
k
(n i), n = 1 N
k
, (3.20)
where p is the lter order and N
k
is the number of samples in the frame.
The excitation signal is constructed either by adding the tted LF and aspiration noise esti-
mates, lf(n) and an(n), or by simply generating a Gaussian noise source, gn(n), in voiced (V)
and unvoiced (U) segments respectively
e
k
(n) =
_
lf
k
(n) +an
k
(n), k = V
gn
k
(n), k = U
. (3.21)
In practice, since the described JEAS analysis is done independently for each frame, the
continuity of the estimated parameters between adjacent frames is not guaranteed, particularly
within voiced segments. As a result, perceptual artifacts are sometimes produced when the
parameters change too abruptly from frame to frame. To reduce this problem, the voiced glottal
source and vocal tract parameter trajectories are smoothed before resynthesis.
Regarding the vocal tract, the jointly estimated lter coefcients (
1

p
) are rst converted
to LSFs due to their better interpolation properties. Then, each set of the LSF coefcients LSF
p
is
averaged with those of the previous and following frames to obtain a smoother vocal tract lter
estimate for synthesis
LSF
p
k
=
i=k+1
i=k1
LSF
p
i
/3 . (3.22)
As for the glottal source, a similar approach is followed. First, the tted LF timing param-
eters (T
p
, T
e
, T
a
, T
c
) are converted to R-parameters (R
g
, R
k
, R
a
) which are more suitable for
interpolation since they are normalised with respect to the fundamental period. Again, in order
to smooth their trajectories, each R-parameter set is averaged with the ones of the previous and
next frames. Aspiration noise energy ANE trajectories are also smoothed the same way
R
k
=
i=k+1
i=k1
R
i
/3 . (3.23)
Once the source parameters (R
g
, R
k
, R
a
, ANE) have been averaged, they are used to recom-
pute smoothed LF derivative glottal waveforms lf(n) and amplitude modulated aspiration noise
estimates an(n) to be used as the lter excitation e(n) for resynthesis.
In order to synthesise the speech wave, the overlap-add scheme of equation (3.24) is em-
ployed
s(n) =
K
k=1
w
k
(n kN
k
sc
) sc
k
(n kN
k
sc
) , (3.24)
where K is the total number of frames, w
k
is a Hamming window such that
w
k
(n) =
_
0.54 0.46 cos(2
n
N
k
sc
), 0 n N
k
sc
0, otherwise
(3.25)
and sc
k
is a synthetic contribution of length N
k
sc
= N
k1
+N
k
generated by
sc
k
(n) = e
k
(n N
k1
)
p
i=1
i
sc
k
(n i) n = 1 N
k
sc
, (3.26)
so that a k-th synthesis frame of N
k
samples is obtained as
s(n +kN
k
) = w
k1
(n +N
k
)sc
k1
(n +N
k
) +w
k
(n)sc
k
(n) . (3.27)
Figure 3.17 shows a schematic diagram of the employed overlap-add synthesis scheme.
w
k1
(n+N
k
)sc
k1
(n+N
k
) w
k
(n)sc
k
(n) w
k1
(n+N
k
)sc
k1
(n+N
k
)
N
k
Figure 3.17 Overlap-Add Synthesis
JEAS SYNTHESIS ALGORITHM
Input: {
1
,
2
, ,
p
}, {E
e
, T
p
, T
e
, T
a
, T
c
, ANE} in Voiced (V) frames
{
1
,
2
, ,
p
}, GNE in Unvoiced (U) frames
Output: s(n)
// convert the vocal tract filter coefficients to LSFs
1 LSF = {lsf
1
, lsf
2
, , lsf
p
} {
1
,
2
, ,
p
}
// convert the voice source T-parameters to R-parameters
2 R = {R
g
, R
k
, R
a
} {T
p
, T
e
, T
a
, T
c
}
// smooth LSF , R and ANE trajectories
3 LSF
k
=
i=k+1
i=k1
LSF
i
/3 ; R
k
=
i=k+1
i=k1
R
i
/3 ; ANE
k
=
i=k+1
i=k1
ANE
i
/3
// convert the smoothed LSF and R-parameters back to filter and T-parameters
4 {
1
,
2
, ,
p
} LSF
; {T
p
, T
e
, T
a
, T
c
} R
5 foreach frame k do
// construct the excitation signal
6 if frame is V
// generate the smoothed LF wave
7 lf
k
(n) {T
p
, T
e
, T
a
, T
c
}
// generate the smoothed aspiration noise
8 an
k
(n) ANE
// add the smoothed LF wave and aspiration noise components
e
k
(n) = lf
k
(n) +an
k
(n)
9 elseif frame is U
// generate the Gaussian noise source
10 e
k
(n) = gn
k
(n)
end
// generate the synthetic contribution
11 sc
k
(n) = e
k
(n N
k1
)
p
i=1
i
sc
k
(n i) , n = 1 N
k
sc
// Hamming window, Overlap and Add
12 s(n) =
K
k=1
w
k
(n kN
k
sc
) sc
k
(n kN
k
sc
)
end
Summary of the JEAS Synthesis Algorithm
3.4 Pitch and Time-Scale Modication
Due to the explicit and independent modelling of the fundamental period and the interpolation
capabilities of the employed vocal tract and glottal source parameterisations, pitch and time-
scale modications are easily implemented within the JEAS framework.
Both, pitch and time-scale transformations are based on a parameter trajectory interpolation
approach, where the rst task involves calculating the number of frames in a particular segment
required to achieve the desired modications. Once the modied number of frames has been
calculated, frame size contours, excitation and vocal tract parameter trajectories are resampled
at the modied number of frames using cubic spline interpolation [96]. Because JEAS modelling
is pitch-synchronous, the frame sizes correspond with the pitch periods in voiced segments while
they are xed in unvoiced segments. Due to their better interpolation characteristics, LSF coef-
cients and R-parameters are employed during pitch and time-scale transformations to represent
the vocal tract and glottal source respectively, in addition to aspiration ANE and Gaussian GNE
noise energies.
Time-scale modication is carried out by increasing or decreasing the number of frames per
segment and interpolating the parameter tracks accordingly. For example, in order to increase
the duration of a voiced segment of f frames by 25%, the modied number of frames is calcu-
lated as mf = f + 0.25f. Then, the f-point pitch period contour is resampled at the new set
of uniformly spaced mf points as shown in Figure 3.18. This way, the contour of the funda-
mental period, i.e. the intonation, is preserved while its variation is slowed down. The same
resampling needs to be applied to each of the LSF coefcient, R-parameter and ANE tracks to
synthesise time-modied speech. Unvoiced segments can also be time-scaled using the described
procedure. In this case, the excitation parameter trajectories to resample are the energies of the
Gaussian noise source GNE.
0
50
100
150
200
250
300
350
400
450
500
frames
f
s
i
z
e

(
s
a
m
p
l
e
s
)
fsize contour
f
mf=f+0.25*f
Figure 3.18 Resampling the frame size contour
Pitch can be altered by simply multiplying the fundamental period contour by a scaling factor.
For example, if a given pitch period contour of f frames T = {T
1
, T
2
, , T
f
} is multiplied by
0.5, speech synthesied with the modied contour T
= 0.5 T = {T
1
, T
2
, , T
f
} would be
perceived to have twice the original fundamental frequency. However, its duration would also be
perceived to be half the original. Scaling the fundamental periods involves modifying the frame
sizes and, as a consequence, the segment durations. For this reason, the number of frames in a
segment also needs to be modied when scaling pitch if its duration is to be maintained. The
modied number of frames mf at the scaled fundamental periods whose duration approximates
the original can be calculated as
mf = f +
(
T

T
) f
(3.28)
where

T is the original mean fundamental period,

T
is the scaled mean fundamental period.

Once mf has been calculated, the scaled pitch period contour T
, LSF coefcients, R-parameters

and ANE trajectories must be resampled at the new number of frames before resynthesising the
pitch-modied speech wave.
4
JEAS Voice Source and CART Duration Modelling for VC
In this chapter, the use of JEAS glottal source parameterisation and linear transformation is
explored for voice source conversion and the performance of the resulting JEAS VC framework
is compared against that of a state-of-the-art Sinusoidal VC system. The rst two sections detail
the speech models and feature transformation techniques employed in each VC implementation.
Objective measurement of their spectral envelope and voice source conversion performance and
subjective evaluation of the recognizability and quality of the converted outputs is presented
next. In the nal section, a CART duration modelling and conversion approach is proposed and
discussed.
4.1 Sinusoidal Voice Conversion
The voice morphing system developed by KK Hui Ye [106, 104, 107] has been used as a baseline
implementation. It employs a Pitch-Synchronous Harmonic Model (PSHM) to represent and
manipulate the speech signal and linear transformations to convert spectral envelopes. In ad-
dition, it also applies a method to transform residuals and includes a technique to mitigate the
artifacts caused by the unnatural phase dispersion produced when only sinusoidal amplitudes
are modied. These are further detailed in the next sections.
4.1.1 PSHM
The PSHM approximates each frame, whether voiced or unvoiced, of the speech signal s
k
(n) by
the following summation of harmonics
s
k
(n) =
L
k
l=0
A
k
l
cos(l
k
0
n +
k
l
) n = 0, , N
k
, (4.1)
where s
k
(n) is the k-th synthetic speech frame,
k
0
is the fundamental frequency of the k-th
frame in radians and A
k
l
and
k
l
are the harmonic amplitudes and phases respectively. L
k
is
54
CHAPTER 4. JEAS VOICE SOURCE AND CART DURATION MODELLING FOR VC 55
the number of harmonics in each frame, which is determined by the integer part of /
k
0
. A
constant frame rate of 100 Hz is used to replace the pitch values in unvoiced frames, which are
then treated equally as voiced frames in the analysis synthesis process.
The sinusoidal parameters are estimated using an analysis by synthesis procedure. The
speech signal is divided into small overlapping frames consisting of two pitch periods, convolved
with a trapezoidal analysis window. In order to reduce estimation errors, the pitch marks are
adjusted to be positioned at the closest zero-crossing points. The optimal amplitudes and phases
are then calculated using an iterative algorithm which minimises the windowed least square
error (LSE) of equation (4.2) by optimizing one harmonic at a time
E =
n
k
(n)[s
k
(n) s
k
(n)]
2
. (4.2)
Details of its implementation can be found in [104].
The harmonics estimated using the iterative routine can closely represent the original speech
signal at the center pitch period. For synthesis, overlap-add is employed to reduce the disconti-
nuities at frame boundaries. Complementary trapezoidal windows and an overlapping of 1/10th
of a frame are used.
The pitch-synchronous nature of the model is exploited for time and pitch-scale modication.
Time-scale transformations are obtained by arbitrarily copying or deleting synthesis frames to
achieve the desired duration. Pitch modication is done following the ABS/OLA phasor in-
terpolation approach [39], which generates a pitch-shifted set of excitation components by
resampling the excitation spectrum at the modied harmonic frequencies. However, unlike the
ABS/OLA implementation, pulse alignment is not required after PSHM pitch transformation
since the pitch-synchronous nature of the model ensures that pitch pulses are well aligned.
The overall shape of the speech spectrum is captured by spectral envelopes encoded using line
spectral frequencies (LSF). LSF envelopes are estimated for each speech frame as follows. First,
the PSHM harmonic amplitudes A
k
l
are used to represent the magnitude spectrum. This is
then resampled according to the Bark scale using cubic spline interpolation [96]. Once the
magnitude spectrum has been warped, LP coefcients are computed by applying the Levinson-
Durbin algorithm to the autocorrelation sequence of the warped power spectrum. Finally, the
estimated LP coefcients are converted into LSF parameters.
The linear transformation function of equation (2.20), trained with a LSE estimation criteria
using parallel source and target data, is employed to convert the LSF spectral envelopes. Af-
ter conversion, the new LSF parameters are transformed back to LP coefcients for synthesis.
Because the use of linear transformations broadens the formants of the converted speech, the
perceptual post-lter described in equation (2.22) is applied to narrow the formant bandwidths,
deepen the spectral valleys and sharpen the formant peaks.
4.1.3 Spectral Residual Prediction
Residuals are converted using a codebook based spectral residual prediction method, which
exploits the correlation existing between residuals and spectral envelopes.
First, a GMM model is trained to cluster the target LSF spectral envelopes into M classes
(C
1
, , C
M
). For each target envelope y
t
the posterior probability P(C
m
|y
t
) of y
t
belonging to
a particular class m is given by
P(C
m
|y
t
) =

m
N(y
t
;
m
,
m
)
M
i=1
i
N(y
t
;
i
,
i
)
, (4.3)
where
m
,
m
and
m
are the weights, means and variances of the GMM components respec-
tively and N() represents the Normal Distribution.
The log spectral residual r
t
associated with each target LSF training vector y
t
is calculated as
r
t
= 20log
10
H(t)
sin
20log
10
H(t)
env
, (4.4)
where H(t)
sin
are the harmonic magnitude spectrum components of frame t and H(t)
env
corre-
sponds to the LSF spectral envelope. So that the residual spectra are normalised with respect to
F
0
, they are all resampled to 100 points.
The resampled target residuals are then used to build an M-entry spectral residual codebook
R = [R
1
, R
2
, , R
M
] which is used to generate the target least square error (LSE) residual pre-
diction r as a linear function of the posterior probability vector P(y) = [P(C
1
|y), , P(C
M
|y)]
for a particular spectral envelope y, that is

r = RP(y) . (4.5)
If the least square prediction error on the training data is dened as
=
T
t=1
(r
t
RP(y
t
))
(r
t
RP(y
t
)) , (4.6)
then R can be estimated by
R =
_
T
t=1
r
t
P(y
t
)
__
T
t=1
P(y
t
)P(y
t
)
_
1
. (4.7)
During conversion, the optimal LSE spectral residual for a converted spectral envelope is simply
obtained by applying equation (4.5).
4.1.4 Phase Prediction
Because the above described Spectral Envelope Conversion and Spectral Residual Prediction
methods only modify spectral magnitudes, the transformed speech samples will include artifacts
caused by the mismatch between the modied magnitudes and the original phases. Yes voice
conversion systememploys a Phase Prediction approach similar to that used for Spectral Residual
Prediction to mitigate this problem.
Based on the correlations between the spectral envelopes and the waveform shape, the em-
ployed phase prediction technique builds a waveform codebook T = [T1, , T
M
] which can
predict the target waveform shape s(t) given a particular spectral envelope y as
s(t) = T P(y) . (4.8)
The waveform codebook is chosen to minimise the following LSE over the training data
E =
N
t=1
(s(t) T P(y
t
))
(s(t) T P(y
t
)) , (4.9)
where s(t) is the t-th speech frame in the training data normalised to a pitch of 100 Hz.
The standard solution to equation (4.9) is then used to estimate T
T =
_
N
t=1
s(t)P(y
t
)
__
N
t=1
P(y
t
)P(y
t
)
_
1
. (4.10)
Once T has been calculated, the waveform shape of a converted spectral envelope is pre-
dicted via equation (4.8). Then, by exploiting the correlation between the waveform shape and
phase dispersion, the phases of the predicted waveform shape are substituted after conversion
to avoid the artifacts caused by the mismatched original source phases.
4.2 JEAS Voice Conversion
The spectral envelope and glottal waveform transformation methods employed within JEAS
voice conversion are described in the following two sections. While spectral envelope conversion
is done in a way very similar to the above sinusoidal voice conversion implementation, the main
advantage of JEAS Modelling, i.e. the parameterisation of the voice source, allows the source
characteristics to be also transformed to match the target. As well as offering the potential for
improved delity in the target identity, this also avoids the need for the previously described
residual prediction methods. In addition, because the JEAS parameterisation does not involve
a magnitude and phase division of the spectrum, the artifacts due to converted magnitude and
phase mismatches are not produced and, thus, the use of phase prediction is not required.
As in the sinusoidal voice morphing implementation, the jointly estimated JEAS all-pole vocal
tract lter coefcients {
1
, ,
p
} are converted to Bark scaled LSF parameters for the trans-
formation of the JEAS spectral envelopes. First, the linear frequency response of the jointly
estimated vocal tract lter is calculated. Again, this is resampled according to the Bark scale
using cubic spline interpolation [96] and the warped all-pole lter coefcients are computed by
applying the Levinson-Durbin algorithm to the autocorrelation sequence of the warped power
spectrum. The lter coefcients are nally transformed into LSF for conversion.
Linear transformations and post-ltering are also employed to convert the JEAS LSF spectral
envelopes, which are then changed into all-pole lter coefcients and resampled back to the
linear scale before synthesis.
Thus, the spectral envelope conversion method employed in both the Sinusoidal and the
JEAS implementations is essentially the same, the only difference lying in the particularities of
the sinusoidal and JEAS spectral envelopes. This is illustrated in Figure 4.1, where it can be
seen that the PSHM envelopes capture the spectral tilt but in JEAS, it is encoded by the glottal
waveforms instead. In addition, whilst both methods manage to represent the most important
formants, small differences exist in their amplitudes, frequencies and/or bandwidths.
0 2000 4000 6000 8000 10000
30
20
10
0
10
20
30
40
50
frequency (Hz)
l
o
g

a
m
p
l
i
t
u
d
e

(
d
B
)
JEAS
PSHM
Figure 4.1 JEAS vs. PSHM spectral envelopes
4.2.2 Glottal Waveform Conversion
Previous work on glottal waveform conversion has demonstrated that the quantization of glottal
parameters is possible and capable of capturing voice source quality differences. For exam-
ple, Childers et al. [16] built 32-entry codebooks of polynomial voice source parameters from
sentences produced with different voice qualities and managed to achieve conversions between
modal, vocal fry, breathy, rough, falsetto, whisper and hoarse phonations. However, experiments
involving transformations between more similar phonations, i.e. different modal speakers, or
alternative conversion methods have not been explored yet. The use of LF glottal parameterisa-
tions has not been investigated either.
The glottal waveformmorphing approach adopted within JEAS voice conversion employs lin-
ear transformations to map glottal LF parameters of different modal male and female speakers,
which are the most commonly used speaker types in voice conversion applications.
Linear transformations have been chosen for being the most robust and efcient approach
found to convert spectral envelopes. The limitations of the codebook-based conversion methods
for envelope transformations, i.e. the discontinuities caused by the use of a discrete number
of codebook entries, can also be extrapolated to the modication of glottal waveforms. Thus,
the use of continuous probabilistic modelling and transformations is expected to achieve better
glottal conversions too.
The feature vectors employed to convert the glottal source characteristics are derived from
the JEAS model parameters linked to the voice source of every pitch period, i.e. the glottal
excitation strength E
e
and timing parameters (T
p
, T
e
, T
a
, T
c
) obtained from the LF tting proce-
dure and the energy of the aspiration noise estimate ANE used to adjust that of the modelled
pitch-synchronous amplitude modulated Gaussian noise. In order to normalise the T
0
dependent
T-parameters for conversion, they are transformed into R-parameters (R
g
, R
k
, R
a
), resulting in
the ve-dimensional feature vector (E
e
, R
g
, R
k
, R
a
, ANE) for glottal waveform conversion.
As it is shown in Figure 4.2, the described glottal conversion approach is capable of bringing
the source feature vector parameter contours closer to the target which, as a consequence, also
produces converted glottal waveforms more similar to the target.
0 5 10 15 20 25 30 35 40 45
0
0.05
0.1
0 5 10 15 20 25 30 35 40 45
0.6
0.8
1
0 5 10 15 20 25 30 35 40 45
0
0.5
1
0 5 10 15 20 25 30 35 40 45
0
0.1
0.2
0 5 10 15 20 25 30 35 40 45
0
2
4
x 10
3
frame number
0 0.005 0.01 0.015
0.06
0.05
0.04
0.03
0.02
0.01
0
0.01
0.02
time (s)
a
m
p
l
i
t
u
d
e
src
tgt
conv
src
tgt
conv
E
e
R
g
R
k
R
a
N
e
src
tgt
conv
src
tgt
conv
a) b)
Figure 4.2 Linear Transformation of LF Glottal Waveforms: a) source, target and converted derivative
glottal LF waves; b) source, target and converted trajectories of the glottal feature vector parameters
(E
e
, R
g
, R
k
, R
a
, N
e
)
JEAS VC ALGORITHM
Input: JEAS-analysed training and test data
Output: JEAS-converted test data
1 foreach training frame do
// convert the linear vocal tract filter coefficients to Bark-scaled LSFs
2 LSF
B
{
1
,
2
, ,
p
}
// convert the glottal source T-parameters to R-parameters
3 {R
g
, R
k
, R
a
} {T
p
, T
e
, T
a
, T
c
}
// define the glottal feature vector G
4 G = {E
e
, R
g
, R
k
, R
a
, ANE}
end
// train the Spectral Envelope conversion function
5 F(LSF
B
) LSE, parallel source and target training data
// train the Glottal Waveform conversion function
6 F(G) LSE, parallel source and target training data
7 foreach test frame do
// convert the linear vocal tract filter coefficients to Bark-scaled LSFs
8 LSF
B
{
1
,
2
, ,
p
}
// transform the Spectral Envelope
9 LSF
B
F(LSF
B
)
// convert the transformed Bark-scaled LSFs back to linear filter coefficients
10 A
= {
1
,
2
, ,
p
} LSF
B
// apply Post-Filtering
12 A
= {
1
,
2
, ,
p
} H
() =
A
(z/1)
A
(z/0.94)
// convert the glottal source T-parameters to R-parameters
13 {R
g
, R
k
, R
a
} {T
p
, T
e
, T
a
, T
c
}
// define the glottal feature vector G
14 G = {E
e
, R
g
, R
k
, R
a
, ANE}
// transform the Glottal Waveform
15 G
F(G)
// convert the transformed glottal source R-parameters back to T-parameters
16 {T
p
, T
e
, T
a
, T
c
} {R
g
, R
k
, R
a
}
end
Summary of the JEAS VC Algorithm
4.3 PSHM vs. JEAS Voice Conversion
The performance of the described PSHM and JEAS systems has been evaluated in a conver-
sion task based on the VOICES database [48]. Specically designed for voice conversion pur-
poses, the corpus is composed of 3 instances of 50 phonetically rich sentences spoken by 10
speakers (5 male, 5 female), i.e. a total of 150 utterances per speaker. The speech data was
recorded at a 22kHz sampling rate using a mimicking approach which resulted in a natural
time-alignment between the identical sentences produced by the different speakers and factor
out the prosodic cues of speaker identity to some extent. Glottal closure instants derived from
laryngograph signals are also provided for each sentence, and have been used for both PSHM
and JEAS pitch synchronous analysis. Four different voice conversion experiments have been
investigated: male-to-male (MM), male-to-female (MF), female-to-male (FM) and female-to-
female (FF) transformations. The rst 120 sentences are used for training and the remaining 30
for testing each speaker pair conversion.
LSF spectral vectors of order 30 have been employed throughout the conversion experiments,
to train 8 linear spectral envelope transforms between each source and target speaker pair using
the parallel VOICES training data. This number has been chosen for being capable of achieving
small spectral distortion ratios while still generalising to the test data [104]. Aligned source-
target vector pairs were obtained by applying forced alignment to mark sub-phone boundaries
and using Dynamic Time Warping to further constrain their time alignment. For residual and
phase prediction, target GMMs and codebooks of 40 classes and entries have been built. Finally,
glottal waveform conversions have also been carried out using 8 linear transforms per speaker
pair. Objective and subjective evaluations have been used to compare the performance of the
two methods.
4.3.1 Objective Evaluation
4.3.1.1 Spectral Envelope Conversion
Because the linear spectral envelope transformations are actually applied to LSF vectors, their
conversion performance can be easily evaluated by comparing source, target and converted LSF
vector distances. If the distance between two LSF vectors lsf
1
and lsf
2
is dened as
D
LSF
(lsf
1
, lsf
2
) = |lsf
1
lsf
2
| =
_
(lsf
1
lsf
2
)
(lsf
1
lsf
2
) , (4.11)
the following distortion ratio R
LSF
can be used as an objective measure of how close the source
vectors have been converted into the target
R
LSF
=
L
t=1
D
LSF
(lsf
conv
(t), lsf
tgt
(t))
L
t=1
D
LSF
(lsf
src
(t), lsf
tgt
(t))
100 , (4.12)
0
10
20
30
40
50
60
70
80
90
100
R
L
S
F
PSHM
JEAS
MF FM FF MM
PSHM
JEAS
Figure 4.3 R
LSF
distortion ratios of the converted PSHM and JEAS spectral envelopes
where lsf
src
(t), lsf
tgt
(t) and lsf
conv
(t) are the source, target and converted LSF vectors respec-
tively and the summation is computed over the time-aligned test data, L being the total number
of test vectors after time alignment. Note that a 100% distortion ratio corresponds to the distor-
tion between the source and the target.
R
LSF
ratios have been computed for the PSHM and JEAS spectral envelope conversions on
the VOICES test set. Figure 4.3 shows the obtained results. Although the differences are small,
JEAS has been found to perform slightly better than PSHM with LSF distortion ratios 3% smaller
in all conversion tasks overall. This might be due to the fact that JEAS spectral envelopes do
not encode spectral tilt information, which reduces the LSF variations caused by tilt differences
resulting in more accurate mappings.
4.3.1.2 Voice Source Conversion
Similar objective distortion measures can also be used to evaluate the conversion of the voice
source characteristics, i.e. Residual Prediction and Glottal Waveform Conversion in the PSHM
and JEAS implementations respectively.
Residual Prediction reintroduces the target spectral details not captured by spectral envelope
conversion, bringing as a result the converted speech spectra closer to the target. Glottal Wave-
form Conversion, on the other hand, maps time-domain representations of the glottal waveforms
which in the frequency domain result in better matching glottal formants and spectral tilts of
the converted spectra. Whilst the methods differ, their spectral effect is similar, i.e. they aim to
reduce the differences between the converted and the target speech spectra.
One way to evaluate if the voice source conversion methods achieve the desired effect is to
measure the log spectral distances (LSD) between the converted and target spectra before and
after voice source conversion. If the RMS log spectral distance between two spectra is dened
as
0
10
20
30
40
50
60
70
80
90
100
R
L
S
D
RP
GWC
MM MF FM FF
GWC
RP
Figure 4.4 R
LSD
distortion ratios of Residual Predicted (RP) and Glottal Waveform Converted (GWC) spectra
D
LSD
(S
1
, S
2
) =
_
1
K
K
k=1
(10log
10
(s
1,k
) 10log
10
(s
2,k
))
2
(4.13)
where {s
k
} are the harmonic amplitudes resampled from power spectrum S at K points on the
bark frequency scale (K has been set to 100 points in this work). Then, a distortion ratio R
LSD
similar to R
LSF
can be used to compare the converted-to-target log spectral distances with and
without voice source conversion
R
LSD
=
L
t=1
D
LSD
(S
conv
(t), S
tgt
(t))
L
t=1
D
LSD
(S
orig
(t), S
tgt
(t))
100 (4.14)
where S
conv
(t) and S
orig
(t) are the converted spectra with and without voice source con-
version respectively and S
tgt
(t) is the target spectrum. Thus, a 100% ratio corresponds to the
distortion between spectral envelope converted spectra without voice source transformation and
the target spectra.
Figure 4.4 illustrates R
LSD
ratios computed for Residual Prediction and Glottal Waveform
Conversion on the test set. Results show that both voice source conversion techniques manage
to reduce the distortions between the converted and target speech spectra. Residual Prediction
performs slightly better, mainly because the algorithm is designed to predict residuals which
minimise the log spectral distance represented in R
LSD
. In contrast, glottal waveform conver-
sion is trained to minimise the glottal parameter conversion error over the training data and not
the log spectral distance. Nevertheless, both methods are successful in bringing the converted
spectra close to the target.
0
5
10
15
20
25
30
35
40
45
50
p
e
r
c
e
n
t
a
g
e
PSHM
JEAS
NSP
MM MF FM FF
PSHM
JEAS
NSP
Figure 4.5 Results of the ABX test
4.3.2 Subjective Evaluation
In order to compare the PSHM and JEAS voice conversion systems perceptually, a listening test
was carried out to check their performance in terms of recognizability and quality. 12 subjects
took part in the perceptual study, which consisted of two parts.
The rst part was an ABX test in which subjects were presented with PSHM-converted (A),
JEAS-converted (B) and target (X) utterances and were asked to choose the speech sample A or
B they found sounded more like the target X in terms of speaker identity. Spectral envelopes and
voice source characteristics were transformed with the methods described above for each system,
i.e. spectral envelope conversion, residual and phase prediction were used for PSHM transfor-
mations and spectral envelope and glottal waveform conversion for JEAS transformations. In
addition, the prosody of the target was employed to synthesise the converted sentences in or-
der to normalise the pitch, duration and energy differences between source and target speakers
for the perceptual comparison. 10 utterances of each conversion type (MM, MF, FM, FF) were
presented. The order of the samples in terms of conversion type and conversion system was
randomised. Informal listening of the utterances transformed using the PSHM and JEAS con-
version systems revealed that it was often very difcult to convincingly choose between systems
in terms of speaker identity. For this reason, subjects were also allowed to select a NO STRONG
PREFERENCE option when they found it difcult to choose or did not have a strong preference
towards one of the presented A or B speech samples.
Figure 4.5 shows the results of the ABX test. In all conversion types, the JEAS-converted sam-
ples are preferred over the PSHM-converted ones overall, but the preference difference varies
depending on the type of conversion, being for example almost the same for FM transforma-
tions. However, the NO STRONG PREFERENCE (NSP) option has been selected almost as often
as the JEAS-converted utterances in general, which reveals that subjects found it really difcult
to distinguish between conversion systems in terms of speaker identity. Because the most impor-
tant speaker identifying cues, i.e. spectral envelopes, are transformed using the same method
0
10
20
30
40
50
60
70
80
90
100
p
e
r
c
e
n
t
a
g
e
PSHM
JEAS
MM MF FM FF
PSHM
JEAS
Figure 4.6 Results of the quality comparison test
in the two conversion implementations, it is expected that both systems should perform equally
in terms of speaker recognizability. In addition, the obtained results show that the Residual Pre-
diction and Glottal Waveform Conversion techniques are also comparable in terms of perceptual
speaker identity transformation.
The second listening test aimed at determining which system produces speech with a higher
quality. Subjects were presented with PSHM and JEAS converted speech utterance pairs and
asked to choose the one they thought had a better speech quality. Results are illustrated in Figure
4.6. There is a clear preference for the sentences converted using the JEAS method, chosen
75.7% of the time on average, which stems from the clearly distinguishable quality difference
between the PSHM and JEAS transformed samples. Utterances obtained after PSHM conversion
have a noisy quality caused by phase discontinuities which still exist despite Phase Prediction.
Comparatively, JEAS converted sentences sound much smoother. This quality difference is also
thought to have slightly biased the preference for JEAS conversion in the ABX test.
4.4 Duration Conversion
Whilst spectral envelope and voice source transformations are capable of generating converted
speech which is closer to the target in terms of recognizability, there is still a considerable amount
of source speaker identity present in the prosody of the converted samples. While the mean and
standard deviation of the fundamental frequency F
0
are generally adjusted to match those of
the target, the source duration characteristics are mostly kept unchanged. In this section, the
use of decision trees is investigated for duration conversion.
Intrinsic phone durations have been shown to be speaker specic [73]. In addition, duration
contours also differ among speakers due to accent, dialect, emotional state or language impair-
ment. These factors result in highly variable, speaker and context dependent average phone
durations, which makes their modelling a difcult task.
Most work to date on duration modelling has been done in the context of text-to-speech
(TTS) synthesis. Early TTS implementations mainly employed rule-based duration models. De-
spite their reasonable performance, such models often over-generalise and cannot handle ex-
ceptions without becoming too complicated. For these reasons, computational progress and
availability of large speech corpora has favoured the development and increased use of data
driven approaches in state-of-the-art applications. Among these, classication and regression
trees (CART) [79], additive/multiplicative linear statistical models [99] or stochastic neural
networks and HMMs [13] have been increasingly used to model duration in TTS systems.
However, to the authors knowledge, none of these duration models has yet been applied to
convert durations in voice conversion systems. The only duration conversion attempt follows a
weighted VQ transformation approach, which exploits the phonetic codebooks and weights de-
veloped for the conversion of spectral envelopes to modify the mean value of the speech frame
durations. Whilst this method achieves more natural and intelligible outputs than uniformly
modifying the speaking rate of the source to match that of the target, it cannot transform the
original speech rhythm. The use of a duration model capable of capturing the rhythmic differ-
ences of the source and target speakers is therefore expected to improve the duration conversion
performance.
The basic idea proposed here for duration conversion is to use the durations predicted by a
CART tree trained on target data instead of the original source ones. We have chosen to explore
the use of CART trees for duration conversion for two main reasons: because standard tools for
their generation exist and because the derived trees can be interpreted and used to determine
the most relevant features.
CART trees consist of nodes, arcs and leaves. In each node a question is asked to nd
out wether a particular tree feature satises a given condition. Features can be continuous or
discrete. Each arc to a child represents a possible value of a feature. Each leaf represents a
possible value of the target variable given the values of the features on the path from the root.
CART trees can be learned by recursively splitting a training set into subsets based on feature
value tests until a stopping condition is full-lled.
Within TTS implementations [79, 99, 13], features such as segment identity, identity of
the preceding and following segments, position of the segment within a syllable, word and/or
utterance, stress and accent are generally considered for inuencing segmental duration. In this
work, phonemes have been chosen as segmental units and only the simplest of the mentioned
features have been explored, due to restrictions imposed by the limited amount of available
training data. Details of the employed features are as follows:
Phone identity: one of the 46 phones of the adopted British English BEEP dictionary phone
set, e.g. /ae/, /jh/, /zh/
Previous Phone identity: mapped into one of the following broad classes: V (vowel), L
(liquid), S(stop), F (fricative), A(affricate), N(nasal) and SIL (silence)
Next Phone identity: also mapped into a broad class, i.e. V, L, S, F, A, N, SIL
Position of the Phone in the sentence: each sentence is divided into three segments, i.e.
initial, middle and nal
Taking these features into account, three decision trees {T1, T2, T3} were built for each
speaker, using the Matlab implementation of CART [65]. While T1 was trained only taking
phone identity into account, T2 also included context (previous and next) phone identity fea-
tures mapped into the seven broad classes dened above. T3 added the position of the phone
in the sentence to the features used by T2. The more features are used for training, the more
complex the trees become. As a result, T1 only has 2 levels while those go up to 10 in T2 and
reach 14 in T3. Visual inspection of the trees revealed that the most important feature which
contributes to the prediction of phone durations is the identity of the phones themselves, always
appearing in the highest levels.
Figure 4.7 shows a partial example of T3. The tree assigns different durations for phone /z/
when it occurs in different contexts and sentence positions. A duration value of 59 ms is assigned
when the previous phone is a vowel, liquid or nasal. A duration value of 95 ms is assigned when
it satises the following criteria: it occurs in the middle of the sentence and the previous phone
is not a vowel, liquid or nasal. A duration value of 145 ms is assigned when its position in the
sentence is initial or nal and the previous phone is either plosive, fricative, affricate or silence.
phone is z
prev. phone is {V, L, N}
pos. is middle
n
n
n
y
y
y
59
95 145
Figure 4.7 An example partial CART decision tree for phone duration prediction
The VOICES database has been used to test the proposed algorithm in standard voice con-
version tasks. Duration modications have been performed on top of the JEAS spectral envelope
and glottal waveform conversions described in the previous section for MM, MF, FM and FF
transformations. Only those sentences unconstrained in terms of timing and intonation have
been considered. Because of the resulting limited amount of data available to train the CART
tree for each speaker, i.e. approximately 3min when at least 20min are generally used to train
the CART trees employed for duration modelling in TTS synthesis, the test sentences have been
also been included during training.
Two objective measurements have been employed to evaluate the duration conversion per-
formance. The rst one is a duration distortion ratio, similar to those previously employed for
the evaluation of spectral envelope and voice source transformations. If the mean square error
distance (MSE) between two duration contours {dur
1
, dur
2
} is dened as
D
MSE
(dur
1
, dur
2
) =
1
NP
NP
ph=1
(dur
ph
1
dur
ph
2
)
2
, (4.15)
where NP is the total number of phones, the ratio R
MSE
of equation (4.16) can then be used
to measure how close the converted duration contour is from the target. A ratio of 100% corre-
sponds to the distance between the original source and the target duration contours.
R
MSE
=
C
c=1
D
MSE
(dur
conv
(c), dur
tgt
(c))
C
c=1
D
MSE
(dur
src
(c), dur
tgt
(c))
100 , (4.16)
C being the total number of duration contours employed for evaluation.
The other measure is the correlation (Corr) between duration contours {dur
1
, dur
2
}, calcu-
lated as
Corr(dur
1
, dur
2
) =
cov(dur
1
, dur
2
)
dur
1
dur
2
, (4.17)
where cov means the covariance and
dur
1
and
dur
2
are the standard deviations of the duration
contours. In order to evaluate the duration conversion algorithmthe Corr between source, target
and converted duration contours can be compared.
Figures 4.8 and 4.9 show the R
MSE
and Corr results for the explored four conversion tasks.
As can be seen, the addition of context and positional features reduces the R
MSE
distortion ratio
and increases the correlation Corr between the converted and target duration contours in all the
cases. However, the amount of improvement depends on how similar the source (S) and target
(T) duration contours were to start with. As expected, the technique performs worst when the
correlation between the S and T contours is very high ( .80) and thus, the speakers are originally
very similar in terms of duration characteristics. That is the case, for example, of speaker 1 and
3 of MM and MF conversions respectively. Informal listening revealed that due to the similarity
of the duration patterns of the speakers in the database, duration conversion did not produce
big perceptible differences in most cases.
Whilst these results can only be regarded as a closed set example, they demonstrate the
potential capabilities of the proposed duration conversion method in applications where suf-
cient target training data is available and the durational differences between source and target
speakers are higher. As one such application, its use to repair the duration characteristics of
tracheoesophageal speech is discussed in the next chapter.
0
50
100
150
200
MM
R
M
S
E
0
50
100
150
200
MF
R
M
S
E
0
50
100
150
200
FM
R
M
S
E
0
50
100
150
200
FF
R
M
S
E
T1
T2
T3
Figure 4.8 R
MSE
duration distortion ratios
0
0.2
0.4
0.6
0.8
1
MM
C
o
r
r
0
0.2
0.4
0.6
0.8
1
MF
C
o
r
r
0
0.2
0.4
0.6
0.8
1
FM
C
o
r
r
0
0.2
0.4
0.6
0.8
1
FF
C
o
r
r
ST
T1
T2
T3
Figure 4.9 Correlation of source-target (ST) and converted-target (T1,T2,T3) duration contours
5
Application: Tracheoesophageal Speech Repair
In this chapter, JEAS modelling and the CART duration transformation technique developed for
VC are applied to repair tracheoesophageal (TE) speech. The concept of laryngectomy and TE
speech are introduced rst. Then, the two main limitations of TE speech, i.e. voice source
and duration, are described. The adopted TE voice source and duration repair approaches are
presented next. Finally, the developed repair algorithms are evaluated and the obtained results
are discussed in the last section.
5.1 Laryngectomy
In the UK, around 2,200 people are diagnosed with laryngeal cancer every year [2]. Its causes
are strongly associated with tobacco and alcohol use. Though approximately 80% of cases occur
in men, changes in smoking and drinking habits have increased laryngeal cancer rates in women
in recent years. When diagnosed at an early stage, tumors can be treated with radiotherapy or
laser surgery. However, larger tumors and instances of the recurrence of the disease often make
a total laryngectomy necessary.
Laryngectomy is a surgical procedure which involves the removal of the larynx. The Pri-
mary functions of the larynx are breathing and protection of the lungs, i.e. closing the airway
during swallowing or coughing for clearance of the airways. Voice production is a secondary,
not life-saving function of the larynx. It is, nevertheless, a signicant function, since it enables,
together with the vocal tract and articulators, oral communication. After laryngectomy, all these
functions of the larynx are lost and need to be substituted. Breathing and protection of the lungs
is achieved with surgery, which results in a complete separation between the pharynx and the
trachea and thus, between the alimentary and respiratory pathways. Voice and speech restora-
tion, however, cannot be achieved with surgery, being hence the focus on rehabilitation after
laryngectomy.
The loss of the normal voice is generally considered to be the most obvious consequence
of total laryngectomy. Despite advances in alaryngeal communication, laryngectomee speech
achieved with state-of-the-art voice rehabilitation and restoration techniques has lower quality
70
CHAPTER 5. APPLICATION: TRACHEOESOPHAGEAL SPEECH REPAIR 71
and is signicantly less intelligible than normal laryngeal speech [83]. As a result, laryngec-
tomees often nd oral communication difcult and embarrassing. Activities which were normal
and easy before surgery such as speaking on the phone or chatting to friends in a pub become
difcult tasks with their new speech, resulting often in social autoexclusion. Alternative meth-
ods to enhance laryngectomee speech quality and intelligibility are expected to help improve
their oral communication and, as a consequence, the quality of their lives. The voice source
and duration repair algorithms presented in this chapter constitute the rst steps towards the
development of electronic devices that could help improve the communication of TE speakers
under certain conditions, for example, on the telephone.
5.1.1 Speech Production after Laryngectomy
During laryngectomy, the entire larynx (i.e. vocal cords, epiglottis and tracheal rings) is re-
moved. After surgery, the mucosa overlying the pharyngeal muscles in the esophagus and phar-
ynx serve as a new voice source known as the neoglottis. In order to compensate for the lack of
a valve to separate the respiratory and digestive tracts, the trachea is connected to an opening in
the neck called stoma. Laryngectomees therefore still eat through their mouth but they breathe
through the stoma. In addition, they also have to learn a new speech production technique.
Nowadays, there are three main methods used for voice restoration after total laryngectomy:
esophageal speech, tracheoesophageal speech and electrolaryngeal speech.
Figure 5.1 Speech production apparatus before and after laryngectomy
Esophageal speech
In esophageal speech, air is rst brought from the mouth into the esophagus, and then during
eructation this air is brought back into the mouth, causing vibrations of the neoglottis.
Esophageal speech is often described as a harsh voice of low pitch and loudness. This method
is sometimes compared with belching although the air that creates the voice does not come from
the stomach but from the upper part of the esophagus. The amount of air that can be used for
voice production is very small and therefore, the maximum phonation time of this type of speech
is very short, limiting speech production to short segments and sentences.
Not all laryngectomized patients are able to acquire esophageal speech, and the time re-
quired to learn it varies among speakers from as little as a week to as long as a year. However,
this method of voice rehabilitation does not require any additional surgery or prosthesis, and as
a result is adequate for patients whose anatomy after laryngectomy does not meet the require-
ments for further surgery and implantation of a prosthesis.
Tracheoesophageal speech
For this method of voice restoration, a voice prosthesis is inserted into a surgically created
stula between the trachea and the esophagus. This voice prosthesis enables the use of exhaled
air from the lungs for voice production. When the stoma is closed, the exhaled pulmonary air
is directed through the voice prosthesis into the esophagus, setting the neoglottis into vibration.
Then, the voice sound produced by the neoglottis is formed into speech by the normal resonators
and articulators. The voice prosthesis itself does not thus generate any voice sound, but only
allows the air from the lungs to enter the esophagus.
Tracheoesophageal voice restoration has often been cited as the alaryngeal speech alternative
most comparable to normal laryngeal speech in quality, uency and ease of production. It
has a perceptually better quality than esophageal speech, mainly due to its greater speaking
rate and the uency made possible by the voice prosthesis. It is, in addition, characterized
by a longer phonatory duration and louder voice. Intelligibility is also generally higher for
tracheoesophageal speech than for esophageal speech.
The use of tracheoesophageal speech was initiated with the development of a useful voice
prosthesis two decades ago [82], since then it has become the most frequently used method of
voice restoration following total laryngectomy today. The prosthesis acts as a one-way valve,
which enables air passing from the lungs into the esophagus, but prevents leakage of food or
saliva from the esophagus into the lungs. Over the years, several voice prostheses have been
developed. Initial prostheses required manual occlusion of the stoma with a nger or thumb.
Nowadays, tracheostoma breathing valves can be used to achieve hands-free communication.
These valves automatically shunt air into the esophagus for voice production, eliminating the
need for manual interaction.
Electrolaryngeal speech
An electrolarynx is a hand-held device with an electromagnetically vibrating membrane.
When this membrane is held against the skin of the neck or the oor of the mouth, the vibrations
at one xed frequency are transmitted through the skin into the vocal tract, where they are
modulated with the articulatory organs into speech.
Speech produced by an electrolarynx sounds mechanical, robot-like and monotonous. This
is why, although these devices offer rapid acquisition of speech ability and ease of use, most
laryngectomees prefer other alternatives that offer a more natural vocal quality and a capacity
for hands-free communication.
An additional drawback is that it is sometimes difcult to position the electrolarynx in the
neck. Also, in cases of edema or scar tissue of the neck, transmission of the vibrations through
the skin can be very poor. Therefore, in general, an electrolarynx is only used when tracheoe-
sophageal or esophageal speech is not possible.
5.2 Limitations of TE Speech
Since it is the most widely used method of voice rehabilitation after laryngectomy, the research
presented in this chapter focuses on the repair of TE speech. As mentioned before, TE speech
is often regarded as the alaryngeal speech alternative most comparable to normal laryngeal
speech in quality, uency and ease of production. However, its quality and intelligibility are still
signicantly lower than those of laryngeal speech, being perceptually described as more breathy,
rough, low, deep, unsteady and ugly than normal voices [97].
There are two main limitations thought to be responsible for the decreased quality of TE
speech: the inability to properly control the voice source and the duration deviations caused by
the disconnection between the lungs and the vocal tract.
After the removal of the larynx, and thus of the vocal cords, TE speakers have limited voice
production capabilities. It is today well known that after laryngectomy the pharyngoesophageal
segment acts as the new voicing source in substitution of the vocal cords. In addition, some
evidence for TE voice production being an aerodynamic-myoelastic (AM) event similar to normal
voice production has also been found [97]. However, the AM function of the neoglottis differs
amongst speakers, and the inability to properly control it is thought to be an important cause
of the reduced quality of TE speech. This is expected to affect the main speech characteristics
linked to the voice source, i.e. the shape of the glottal wave, the fundamental frequency and
aspiration noise.
Analysis of TE voicing source waveforms obtained by inverse ltering owfunctions recorded
with a circumferentially vented mask has shown that they are highly variable and deviant in
comparison with normal patterns [75]. While the fundamental frequency of male TE speech has
been reported to be similar to that of normal voice [97, 80], that of female TE speech has been
found to be lower than normal female speech and not to differ from that of male TE speakers
[97, 93]. In addition, F0 is in general less stable and, as a consequence, TE speech presents
more deviant F0 and intensity perturbation measures, resulting in higher values of jitter and
shimmer [80, 11] and a rough quality. Regarding aspiration noise, it has also been found to be
higher in TE speech, which is reected in its characteristic breathy quality. Specically, high-
frequency noise has been found to be higher than normal while the harmonics-to-noise-ratio,
the glottal-to-noise excitation ratio and the band energy difference have been shown to be lower
in TE speech [20].
The surgical disconnection between the lungs and the vocal tract, carried out in order to
cope with the removal of the epiglottis, affects the duration pattern of laryngectomee speech.
Esophageal speakers inject air into the esophagus to replace the pulmonary air used in normal
phonation. However, the amount of air they can store in the esophagus is very small compared
to the capacity of the lungs, which reduces their maximum phonation time making esophageal
speech sound faltering. Thanks to the voice prosthesis, tracheoesophageal speakers can use the
air from the lungs as in normal phonation for speech production, improving the naturalness of
their prosody in comparison to esophageal speech. However, the need to control the voice pros-
thesis in order to switch between speaking and breathing still has consequences in the duration
pattern of TE speech. In general, TE speakers tend to stop more often, produce vowels with
longer duration, speak with slower rates than normal subjects and rush phones when they are
running out of air. Measures of TE speech duration have conrmed that the main differences
compared to normal are shorter maximum phonation times, longer vowel durations and slower
rates [80].
The TE speech repair approach presented in this chapter aims at repairing the mentioned
two main limitations of TE speech mentioned above, i.e. abnormal voice source and duration,
by applying the JEAS speech model-based glottal source and duration conversion techniques
developed in this thesis.
5.3 Speech Corpora
Thirteen tracheoesophageal (11 male, 2 female) speakers provided the data for the repair ex-
periments. All of the patients came from the Speech and Language Therapy Department of
Addenbrookes Hospital in Cambridge (UK) and were referred to this project by the therapist
responsible for their treatment as representative of a wide range of TE vocal qualities.
Sociodemographic and clinical information of the TE speakers participating in the recordings
is shown in Table 5.1. The subjects age ranged from 45 to 80, with a mean age of 65 years.
Post-operation time ranged from 10 months to 18 years and 9 months, with a mean of 5 years.
In addition to total laryngectomy, most patients were also treated with radiotherapy. In order
to avoid hypertonicity of the pharingoesophageal segment, which is considered to be the main
cause of voice failure after laryngectomy, techniques to decrease the tonicity of the neoglottis
were also performed in some patients. Myotomy involves surgically cutting the pharyngeal
muscles to prohibit their contraction. Neurectomy, on the other hand, consists on surgically
cutting the pharyngeal nerve branches instead. All these factors, together with the experience
and adaptation of the patients to the use of the valve, inuence the quality of the resulting TE
speech.
In addition, a control group of eleven (8 male, 3 female) normal voiced subjects produced
and recorded the same stimuli. These subjects were of similar ages to the subjects in the patient
group.
Recordings were made of each subject producing speech at a comfortable level of pitch and
Pat. no. Sex Age Post-operation time Radiotherapy Myotomy Neurectomy
(yr;mnth)
1 M 72 18;9 yes no no
2 M 68 2;1 yes no yes
3 M 67 2;3 no no no
4 M 65 6;0 yes yes no
5 M 72 8;2 yes no no
7 M 64 14;1 yes no no
8 F 58 2;2 yes yes no
9 F 66 5;0 yes no no
10 M 75 4;10 yes yes no
12 M 48 2;0 yes yes yes
13 M 45 15;7 yes no no
Table 5.1 Sociodemographic and clinical information of the TE speakers
loudness. The recorded stimuli consisted of sustained vowels, the phonetically balanced Rain-
bow Passage and a small set of descriptive sentences. A list of the recorded stimuli can be found
in Appendix A. Electroglottograph (EGG) and speech signals were recorded in a quiet room with
a laryngograph processor (Laryngograph Ltd.) and an external soundcard (Edirol UA25) di-
rectly into a laptop at a sampling frequency of 16kHz. The position of the electrodes was set by
the speech therapist for each TE patient. Unfortunately, the recorded TE EGG signals were too
noisy to be useful for glottal closure instant (GCI) extraction. As a result, these were extracted
manually for TE speech, in order to avoid the artifacts which occur when using automatic GCI
extraction techniques. Unfortunately, speech produced by patient 10 was too aperiodic even for
manual GCI extraction and thus, was excluded from the repair experiments. On the other hand,
GCIs of the recorded normal speech samples were obtained automatically from the correspond-
ing EGG signals.
The described speech database has been used for the development and testing of the TE re-
pair algorithms presented in this work. The parallel TE and normal corpora have been employed
to perform comparative analysis of acoustic glottal source features and the varying TE speech
qualities to thoroughly evaluate the different voice source repair methods. In addition, the 28
sentences available per speaker divided into training (23 utterances) and test (5 utterances) sets
of normal and TE speech, have been used to build, adapt and test the various decision trees and
speech recognition systems developed for duration repair.
5.4 Voice Source Repair
The acoustic consequences of the lack of accurate control over the vibration of the neoglottis are
differences in the glottal waveform itself, higher values of jitter, shimmer and aspiration noise
than normal and a lower F0, particularly in female speakers.
Studies exist which have attempted to tackle these limitations of TE speech. After demon-
strating that LP could actually be used to obtain vocal tract estimates from TE speech [74] and
analysing TE glottal waveforms obtained by inverse ltering [75], Qi and colleagues resynthe-
sised female TE words with a xed synthetic LF derivative glottal wave and with smoothed and
raised F0 [76]. Source replacement was expected to diminish the perceptual effect produced by
the highly variable and deviant TE glottal waves, while the smoothing and raising of F0 were
expected to restore the female speakers gender and personal characteristics. Results showed
that the replacement of the derivative glottal waveform and F0 smoothing alone produced most
signicant enhancement, while increasing the average F0 led to less dramatic improvement.
However, the described repair attempts present some limitations, particularly regarding eval-
uation. First, their experimental assessment was limited to sustained vowels and words, even
though the use of continuous speech is clearly necessary to accurately evaluate perceptual im-
provements. Second, only a small number of TE speakers and TE speech qualities was tested.
Third, the degree of enhancement was not quantied, i.e. the perceptual characteristics that are
improved were not determined nor was the quality of the resynthesised speech analysed. Such
information would be useful, in order to gain insight into the perceptual deviations that might
still need to be repaired.
The voice source repair methods explored in this work follow Qi et al.s [76] approach, i.e.
they substitute TE voice source estimates with synthetic glottal waveforms, smooth fundamental
period and energy trajectories and raise the F0 of some speakers. However, they differ in the
employed source-lter deconvolution technique and the synthetic glottal waveforms used for
resynthesis. In addition, they reduce TE aspiration noise levels to normal values. Furthermore,
the above mentioned limitations in their evaluation are also addressed. In this work, perfor-
mance is assessed on the continuous speech of twelve TE speakers with different qualities and
the perceptual improvement achieved with the repair of each feature is also analysed, together
with the naturalness, rhythm and intelligibility of the repaired speech samples. The following
sections describe the techniques developed to repair the main deviant TE voice source features,
i.e. glottal waveforms, jitter and shimmer trajectories and low fundamental frequencies, and
their perceptual effects in more detail. Evaluation of their performance is then presented in
Section 5.6.
5.4.1 Glottal Replacement
Glottal replacement involves performing source-lter deconvolution and substituting the ob-
tained TE glottal source estimates with a synthetic glottal wave. Employing a source-lter de-
convolution technique capable of adequately separating the glottal source and spectral envelope
components from the speech waveform and using an appropriate synthetic glottal source for
replacement are thus important.
In Qi et al.s repair experiment, deconvolution is done using pitch-asynchronous autocorre-
lation LP, xed pre-emphasis and a 40ms frame size. Despite being a simple way of calculat-
ing spectral envelope estimates, this method does not completely decouple the source and lter
components and thus, does not provide as reliable glottal source waveforms as the other existing
deconvolution techniques introduced in Section 3.1.2. In this work, two alternative source-lter
deconvolution methods have been explored instead: Closed-Phase Covariance Linear Prediction
(CPCLP) and Adaptive Joint Estimation (AJE). Regarding source replacement, two possibilities
have also been investigated: using a xed modal LF waveform as in Qi et al.s approach and
employing the LF waves tted to the obtained TE glottal source estimates. The combination
of these deconvolution and source replacement options has resulted in two glottal replacement
algorithms: CPCLP plus xed LF replacement and JEAS replacement.
CPCLP plus xed LF replacement
The initially developed baseline glottal replacement system was based on CPCLP. CPCLP was
chosen as a deconvolution technique because its formulation does not make any assumptions
on the shape of the voice source whilst still being capable of decoupling the source and lter
components rather accurately. Plus, if multi-cycle CPCLP is applied, glottal opening instant
detection can be avoided. Instead, very short xed closed-phase segments of consecutive pitch
periods can be used to obtain reliable vocal tract lter estimates.
Therefore, the baseline CPCLP glottal replacement algorithm employs multi-cycle CPCLP,
xed pre-emphasis (0.97) and a frame size of three pitch periods for source-lter deconvolution.
The lter order is set to 18 for a sampling frequency of 16kHz and the number of samples in
each xed closed-phase to twice the lter order, i.e. 36. Glottal replacement is then done by
simply substituting the derivative glottal waves calculated by inverse ltering the speech signal
with the CPCLP vocal tract lter estimates with a xed LF waveform. The values proposed for
modal phonation in [16] are used to construct the synthetic LF wave.
However, TE speech resynthesised by this method resulted some artifacts which were not
produced when the same processing was applied to normal speech. Because in both cases the
glottal source was xed to a modal LF waveform, the artifacts could only be due to deviations
in the TE vocal tract lter estimates.
In order to quantify the major differences between the normal and the TE CPCLP spectral
envelopes, a comparative study was performed, where the sustained vowels /AE/, /EH/, /IY/,
/AA/, /AH/, /ER/, /IH/, /AO/, /UW/ and /UH/ as in the words bat, bet, beet, back, but, Bert,
bit, bought, boot and book produced by the 11 normal and the 13 TE speakers of the speech
corpus were investigated. This notation is the upper case version of the ARPAbet.
The following spectral envelope features were compared:
1. First, second and third formant gains (G1,G2,G3), frequencies (F1,F2,F3) and bandwidths
(BW1,BW2,BW3) and their relative differences G12, G23, G13, F12 and F23. Formant fre-
quencies and bandwidths were extracted from the roots of the LP coefcients, while gains
were computed from the values of the log spectral envelopes at the formant frequencies in
a linear scale.
2. The cepstral distance between consecutive envelopes S and S
dened in Equation 5.1,

employed as a spectral distortion measure
d
2
c
(S, S
) = 2
p
n=1
(c
n
c
n
)
2
, (5.1)
where p is the lter order and c
n
and c
n
are cepstral coefcients obtained from the CPCLP
lter coefcients
1

p
through the following recursion
c
n
=
n
1
n
n1
k=1
kc
k
nk
for n > 0 , (5.2)
where
0
= 1 and
k
= 0 for k > p.
3. The spectral tilt, measured as the slope of a 1st order linear regression of the log LP
spectrum.
The mean and standard deviation of these features were obtained for each speaker and
vowel. Then, a Students t-test between normal and TE values was performed in order to nd the
signicant differences. The standard deviations of G1, F2 and BW2 were found to be signicant
in 80%of the vowels (p 0.05). Spectral distortion was also signicant in all vowels (p 0.003).
In addition, differences in G13 were found to be signicant in 90% of the vowels (p 0.05) and
the spectral tilt measures in all vowels (p 0.0002). On the other hand, signicant differences
between normal and TE mean formant frequencies were only found in 40%, 30% and 40% of
the vowels for F1, F2 and F3 respectively.
The previous results suggest that there are two main differences between normal and TE
CPCLP spectral envelopes:
a The higher standard deviation of formant gains, frequencies and bandwidths and spectral
distortion indicate that consecutive CPCLP estimates differ more in TE than in normal
speech.
b The relative gain G13 and spectral tilt differences indicate that TE spectral envelopes are
less tilted than the normal ones overall.
An enhancement algorithm has been developed to solve these two issues. In order to reduce
the higher CPCLP spectral envelope variation, a smoothing approach is employed. First, CPCLP
coefcients obtained during analysis are converted to line spectral frequencies (LSF). A 10th
order median lter is then applied to the LSF trajectories of each voiced segment to smooth pitch-
period to pitch-period variations (see Figure 5.2a). Such a lter order was found to sufciently
Figure 5.2 Enhancement examples: a) LSF smoothing: original (crosses) and smoothed (continuous) tra-
jectories of rst four LSF coefcients b) Tilt reduction: original (continuous) and reduced (dotted) spectral
envelopes
0 0.002 0.004 0.006 0.008 0.01
0.5
0
0.5
time (s)
a
m
p
l
i
t
u
d
e
0 0.002 0.004 0.006 0.008 0.01
0.5
0
0.5
time (s)
a
m
p
l
i
t
u
d
e
0 0.002 0.004 0.006 0.008 0.01
0.2
0.1
0
0.1
0.2
time (s)
a
m
p
l
i
t
u
d
e
0 0.005 0.01 0.015
0.5
0
0.5
time (s)
a
m
p
l
i
t
u
d
e
0 0.005 0.01 0.015
0.5
0
0.5
time (s)
a
m
p
l
i
t
u
d
e
0 0.005 0.01 0.015
0.2
0.1
0
0.1
0.2
time (s)
a
m
p
l
i
t
u
d
e
IF dgw
fitted dgw
normal s peech
TE s peech
a)
b)
c)
Figure 5.3 Normal and TE JEAS derivative glottal wave estimates: a) estimated derivative glottal waveforms,
b) estimated derivative glottal waveforms (IF dgw) and tted LF waves (tted dgw), c) aspiration noise
estimates
smooth the TE LSF deviations while still maintaining their variations caused by articulatory
changes. Regarding the spectral tilt, a rst-order low-pass lter with a 4kHz cut-off frequency
and a -6dB/octave roll-off is applied to each vocal tract estimate which, by attenuating the higher
frequencies, is capable of reducing the overall tilt of the spectral envelopes (see Figure 5.2b) to
normal values.
In addition, using a xed aspiration noise source together with the replacement modal LF
waveform was found to add naturalness to the repaired outputs. The aspiration noise com-
ponent for each pitch period was generated following the aspiration noise model employed in
JEAS, i.e. modulating the amplitude of a Gaussian noise source of xed variance with the modal
LF wave of that pitch period.
JEAS replacement
Whilst LSF smoothing and tilt reduction have been found to be successful in reducing the CPCLP
TE spectral deviations, those deviations are mainly caused by inadequacies in the employed
deconvolution and source replacement approach. In particular, spectral tilt correction is required
due to the mismatch between the estimated TE derivative glottal sources and the synthetic LF
waveform employed for their replacement. The use of xed pre-emphasis and a xed modal
LF source assume that the average tilt of the TE speech spectrum is xed and comparable to
normal, while the comparative CPCLP spectral envelope study has shown that it is generally less
steep in practice. As a result, the xed pre-emphasis approximation introduces modelling errors
in the slope of the TE spectral envelopes which result in perceptual artifacts when convolved
with the synthetic modal glottal waves.
A better decoupling and modelling of the TE voice source is expected to improve the per-
formance of the glottal replacement approach. In this sense, the AJE method developed for
JEAS modelling offers a more accurate framework for source-lter deconvolution and LF tting
provides synthetic source parameterisations which match the tilt of the estimated TE glottal
waveforms. In addition, the parameter trajectory smoothing of JEAS synthesis also helps reduce
the higher consecutive TE spectral envelope variations. For these reasons, the use of the JEAS
Model for TE speech analysis, synthesis and repair has been explored.
Whilst the adequacy of employing the source models developed for normal speech to model
TE excitations might be questioned, the glottal source representations used by the JEAS frame-
work, i.e. the RK and LF models, have been found to be exible enough to capture the deviations
of the TE derivative glottal waves. Typical TE and normal derivative glottal sources obtained us-
ing JEAS modelling are shown in Figure 5.3. Their biggest difference lies in the much higher
amount of aspiration noise present in the TE voice source. On the other hand, the underlying
shape of the TE glottal source does not deviate much and can be captured by the LF tting pro-
cedure. In fact, when the analysed JEAS model parameters are used for synthesis, TE speech
can be resynthesised which is perceptually almost indistinguishable from the original.
Following the JEAS modelling approach, glottal replacement is achieved by substituting the
Figure 5.4 Original TE speech
Figure 5.5 Repaired TE speech
estimated TE glottal sources with the tted LF waveforms. In addition, the variance of the
TE aspiration noise is reduced to normal levels. Utterances repaired using this technique have
less artifacts and sound more natural than those repaired with the CPCLP plus xed LF source
replacement method. The more accurate source-lter decoupling, the use of the tted LF wave-
forms instead of a xed modal source and the addition of an aspiration noise component are
thought to be the main causes of quality improvement. Examples of original TE and JEAS re-
placed utterances are illustrated in Figures 5.4 and 5.5 respectively.
5.4.2 Jitter and Shimmer Reduction
Jitter and shimmer are two source related acoustic measures which are also higher than normal
in TE speech [80, 11]. Jitter can be dened as the short time perturbation of the pitch contour.
Shimmer refers to the period to period variability of the average pitch period energy. Whilst
a small amount of jitter and shimmer are natural in normal speech, deviation is greater in
TE speech mainly because TE speakers cannot accurately control the vibration pattern of the
neoglottis.
Reduction of jitter and shimmer involves modifying the utterance fundamental period and
energy contours. The adopted algorithm follows a smoothing approach based on median lter-
ing. The period to period variation of both, the fundamental period and energy contours of each
voiced segment is reduced by applying 25th and 5th-order median lters respectively. These
values were experimentally found to reduce the perceptual effects of TE jitter and shimmer de-
viations best overall. The differences between the original and smoothed contours constitute a
suitable approximation of jitter and shimmer. One way of reducing their values is to scale their
standard deviations downward. Whilst different jitter and shimmer reduction ratios have been
investigated for different TE speakers, they have not been found to make a big perceptual differ-
ence. Thus, the nal implementation simply employs the smoothed fundamental frequency and
energy contours directly. Figure 5.6 shows a jitter and shimmer reduction example.
5.4.3 F0 Raise
The average fundamental frequency of TE speech has often been cited to be lower than normal,
particularly regarding that of female speakers [80, 93, 97], making TE speech sound deep and
low. Raising the F0 of TE speakers to an average normal value is thus expected to improve its
quality.
In order to nd out which TE speakers of the speech corpus have a F0 signicantly lower
than normal, the average fundamental frequency of each TE speaker in one of the recorded
utterances was calculated and compared to the mean normal male and female values. Tables
5.2 and 5.3 show the obtained results.
Three of the TE male speakers, patients 2, 6 and 11, were found to have mean F0s lower
than the average normal values by 75, 48 and 28 Hz respectively, which result in signicant
0 20 40 60 80 100 120 140 160
50
100
150
200
frame number
T
0
0 20 40 60 80 100 120 140 160
0
0.002
0.004
0.006
0.008
0.01
0.012
frame number
e
n
e
r
g
y
Figure 5.6 Fundamental Period and Energy Contour Smoothing
perceptual differences. The rest of the speakers matched the average normal male fundamental
frequency overall. Regarding the female TE speakers 8 and 9, F0 differences were found to
be higher and around 80Hz lower than average female normal fundamental frequency in both
cases.
When repairing TE speech, only the F0 of TE speakers 2, 6, 11, 8 and 9 is raised to match
the average normal values while the original values are kept in the rest.
5.4.4 Perceptual effect
Due to the lack of appropriate standard objective measures, perceptual scales have been com-
monly used as a gold standard to evaluate the varying qualities of disordered speech. Among
the ranking scales used for the assessment of TE speech quality, deviant-normal,ugly-beautiful
and unpleasant-pleasant have been employed for quality evaluation, while scales such as noise-
no noise, bubbly-not bubbly, breathy-not breathy, rough-not rough, creaky-no creaky or low-high
have been used to describe and characterise different TE speech qualities [97]. The described
voice source repair algorithms have been found to alter the perception of some of these scales,
bringing them closer to normal.
Informal listening of glottal replaced utterances revealed that those were less noisy and
breathy than the original ones in general. The bigger amounts of noise and breathiness present
in TE speech are thought to be due to aspiration noise and incomplete closure of the neoglottis,
which results in air leakage during the closed phase of the glottal source. Because during glottal
replacement the TE glottal sources are substituted with a synthetic modal excitation having zero
ow during the closed phase and a reduced aspiration noise component, the perceived noise
and breathiness are diminished as a result.
Jitter and shimmer reduction were found to decrease roughness and creakiness. This is
consistent with studies that have related roughness and creakiness with cycle to cycle variations
Male TE Speakers Average Normal Male
1 125.7
118
2 43.2
3 131.6
4 142.2
5 109.5
6 70
7 98.2
11 89.6
12 125.8
13 109
Table 5.2 Average Male TE and Normal F0 [Hz]
Female TE Speakers Average Normal Female
8 106
184.4
9 102.5
Table 5.3 Average Female TE and Normal F0 [Hz]
of the fundamental frequency and the period amplitude [59, 100]. As expected, the raising of
the average fundamental frequency of TE speakers with very low F0 caused them be perceived
higher and more natural overall, particularly in the case of female TE speech.
TE VOICE SOURCE REPAIR ALGORITHM
Input: TE speech
Output: voice source repaired TE speech
1 foreach Voiced frame do
// Glottal Replacement
2 if method is CPCLP plus xed LF replacement
// obtain vocal tract filter coefficients
3 {
1
,
2
, ,
p
} xed pre-emphasis, CPCLP
// generate a fixed modal LF wave
4 lf(n) {T
p
= 0.45, T
e
= 0.6, T
a
= 0.005, T
c
= 0.65}
// generate a fixed aspiration noise source
4 an(n) ANE = 0.5
// apply LSF smoothing and tilt reduction
5 {
1
,
2
, ,
p
} LSF Smoothing, Tilt Reduction
6 elseif method is JEAS replacement
// obtain vocal tract filter coefficients
7 {
1
,
2
, ,
p
} Adaptive Joint Estimation (AJE)
// fit LF waveform
8 lf(n) {T
p
, T
e
, T
a
, T
c
}
// reduce the variance of the aspiration noise estimate to normal (N)
9 an(n) ANE = ANE
N
end
// Jitter and Shimmer Reduction
10 T
0
, E
Median Filtering
// Raise mean F
0
to the average normal (N)
11

F
0
=

F
N
0
end
Summary of the TE Voice Source Repair Algorithm
5.5 Duration Repair
1
The main durational limitations of TE speech are shorter maximum phonation times, longer
vowel durations and slower rates. In addition, they generally pause longer and more often
and sometimes rush the last phones before breaks when they are running out of air. Whilst
these deviations are well known, there have been no previous attempts to repair the TE speech
duration pattern.
Possible approaches to repair the mentioned TE duration limitations are
a to derive a set of rules to modify the duration features found to be abnormal (i.e. re-
duce vowel durations and pauses, increase the speech rate and durations of phones before
breaks)
b to substitute TE phone durations with their corresponding normal values.
Preliminary experiments were carried out to evaluate the performance of these two ap-
proaches. The difculty within the rule-based method lies in obtaining adequate reduction/increase
rules and ratios, which generally differ per speaker and per sentence. Experiments with this tech-
nique resulted in unnatural duration contours which despite normalising the deviant duration
features, nevertheless ruined sentence rhythm. On the other hand, transplantation of average
normal phone duration contours obtained from the parallel corpus achieved better results. In-
formal listening of transplanted utterances showed an overall improvement, which increased
the naturalness of the original TE samples. This method not only coped with the observed TE
durational problems but also preserved the rhythmic structure of the sentences.
The developed TE speech duration repair algorithm is an automation of the preliminary
transplantation experiment, which applies the duration transformation technique explored for
voice conversion in the previous chapter (see Section 4.4. Because the method presumes that
TE phone segmentations and normal phone durations are known, its application in a real-time
repair system where the transcription of the input speech is unknown requires the TE phones
to be recognised and their normal durations to be predicted. In addition, methods are needed
to provide robustness to possible recognition and prediction errors. The adopted recognition,
duration prediction and robust modication techniques are described in the next sections.
5.5.1 Tracheoesophageal phone recognition
Speech recognition systems assume that the speech signal represents a message encoded as a
sequence of symbols. To be able to recognise the underlying symbol sequence, they convert the
continuous speech waveform into a sequence of discrete acoustic parameter vectors. Recogni-
tion then involves mapping between the sequences of speech vectors and the underlying symbol
sequences. This is generally done using Hidden Markov Models (HMMs). The fact that different
1
An adapted version of this section has been previously presented in [22]
underlying symbols can give rise to similar speech sounds and that a particular symbol realisa-
tion can have large variations due to speaker variability, mood and environment make speech
recognition a difcult problem.
A variety of HMM symbols can be used, e.g. words, phones, syllables. While words are
appropriate for simple isolated word recognition tasks, they are not scalable to large vocabu-
lary continuous speech recognition. In those cases phones are mostly employed, together with
dictionaries which map phone sequences to words. The use of acoustic HMM models taking
context information into account has been shown to improve recognition performance. Thus,
apart from context independent monophones, biphones taking left or right and triphones tak-
ing left and right phone context into account are also employed. In addition, language models
can improve recognition further by constraining the concatenation of the acoustic HMM model
sequences. N-gram language models are the most commonly used today. They estimate the
probability of a symbol given the preceding N symbols from large collections of transcribed text.
In order to make these estimates tractable, N is generally constrained to one (bigram) or two
(trigram).
In order to deal with the speech variability problem, several techniques can be employed.
Speech recorded in different environments can be normalised using a technique called Cepstral
Mean Normalisation (CMN), which subtracts the cepstral mean of all input vectors removing the
effect of the channel transfer function. Regarding speaker variability, the performance of speaker
independent HMMs trained on large amounts of data from different speakers can be improved
by customising them to the characteristics of the particular speaker to be recognised. This can
be achieved using linear transformation adaptation techniques such as Constrained Maximum
Likelihood Linear Regression (CMLLR), which requires little amount of adaptation data.
Previous work on automatic TE speech recognition by Haderlein et al. [42] involved adapt-
ing a speech recogniser trained on normal speech to single TE speakers by unsupervised HMM
interpolation. They obtained poor results in terms of word accuracy, with an average value of
36.4 %. Our preliminary word level recognition tests on TE speech also showed that extracting
usable orthographic transcriptions was not feasible. Hence, the focus has been on obtaining the
best possible TE phone recognition.
Various systems and techniques have been explored in order to achieve best results. The
baseline recogniser is a monophone system trained on the WSJCAM0 corpus [1] of normal non-
pathological speech. In addition, normalization and adaptation techniques and several acoustic
and language models have been tested. In order to measure and compare performance of the
different systems, two new metrics which not only measure recognition and segmentation accu-
racy but also take duration prediction errors into account have been used.
Measuring Performance
The performance of speech recognisers is generally measured by comparing the output string
with a transcribed reference. The reference transcription can be obtained either manually or by
force-alignment, i.e. by using the word-level transcript of the sentence to constrain an optimal
alignment between the HMM models and the spoken utterance. Then, the percentage of cor-
rectly recognised, substituted, inserted and deleted labels is simply counted. These measures
only take recognition of the correct labels into account, ignoring segmentation accuracy or the
implications of errors in a duration modication task.
Automatically derived transcriptions can also be regarded as consisting of a set of correctly
recognised and segmented sections with error segments in between, in which phones have been
wrongly segmented and/or misrecognised (see Figure 5.7). Differences between the durations
predicted within these error segments and their correct counterparts are the cause of the per-
ceptual artifacts produced when inaccurate phone label transcriptions are used instead of force-
aligned segmentations based on accurate reference transcriptions.
In the context of speech repair, the recogniser not only needs to recognise the correct phones,
but also accurately detect their boundaries. In addition, it should try to minimise the duration
prediction differences within the error segments. The following two measures which take these
requirements into account have been used to evaluate and compare the different recognisers:
Segmentation and Prediction Correctness (SPC): measures the percentage number of phones
which have been correctly recognised, with segmentation boundaries lying within a thresh-
old distance of the reference values. An ideal recogniser would correctly recognise and
segment all phones, and thus have an SPC=1.
SPC =
NP
i=1
r(i) s(i)
NP
, (5.3)
where r and s are boolean variables equal to one if a particular phone has been correctly
recognised or segmented respectively, and NP is the total number of phones in a sentence.
Segmentation and Prediction Error (SPE): sums the differences in duration prediction of
the error segments with respect to the reference values throughout the utterance and nor-
malises its value by the total number of phones in the sentence.
SPE =
ES
s=1
|D(s)
ESP
i=1
d(i)|
NP
, (5.4)
where ES is the number of error segments in the sentence, ESP is the number of recog-
nised phones in a particular error segment, D is the duration predicted by the reference
transcription in a segment and d is the duration prediction of a recognised phone. The best
recogniser will be the one which achieves maximum SPC and minimum SPE values.
FA
REC
Figure 5.7 Force-Aligned (FA) and Recognised (REC) Segmentations and Labels (continuous if correct, dis-
continuous if wrong)
System comparison
In order to improve recognition of the baseline monophone system, the following techniques
were explored. Firstly, cepstral mean normalisation (CMN) was used to normalise the recording
conditions of the training and test data sets. Also, as in [42], adaptation techniques were applied
to compensate for the acoustic differences between normal and TE speech. The TE training data
was used to adapt the baseline models to each of the TE speakers. Due to the limited amount of
data (just 23 sentences per speaker) available for adaptation, constrained maximum likelihood
linear regression (CMLLR) was used. Linear transformations were obtained after 3 iterations,
each using two regression classes for phones and silences/shortpauses. These give the R1 system
listed in Table 5.4. Secondly, phone level bigram and trigram language models (LM) trained on
the WSJCAM0 corpus were introduced to give R2 and R3 in Table 5.4, respectively. Finally, a
system based on triphone HMMs and the word trigram LM described in [51] was tested. These
triphone models were trained on a very large corpus and then adapted to each TE speaker using
CMLLR. As well as being better trained than in the other systems, the use of a word level LM has
the potential to provide better phonotactic constraints than a phone level LM. This nal system
is R4 in Table 5.4.
SYSTEM FEATURES
BL Baseline monophone HMM
R1 BL+CMN+CMLLR
R2 R1+bigram LM
R3 R1+trigram LM
R4 triphone HMMs+CMLLR+word trigram LM
Table 5.4 The recognisers tested and their corresponding features
The overall recognition performance of the different systems on the TE test set was com-
pared. As shown in Table 5.5, normalization and adaptation almost doubled the baseline per-
formance. The addition of bigram and trigram LMs further improved the SPC and SPE results.
However, R4 achieved the biggest improvement and best overall performance. These results
show the value of using rened acoustic and language models to compensate for the small
amount of TE data available for adaptation.
BL R1 R2 R3 R4
SPC [%] 16.3 31.3 32.5 33.4 51.5
SPE ms 39.44 29.71 27.33 26.70 14.26
Table 5.5 Evaluation of Recogniser Performance
5.5.2 Duration Prediction
CART duration models consider features which can be extracted from text. Unfortunately, as
noted earlier, the high word error rates of the TE word level transcriptions prohibit their use in
a practical duration repair system. As a result, only recognizable phone level information such
as phone identity, identities of the previous and next phones and position of the phones in the
sentence is available. In addition to these contextual and positional factors, the use of pitch and
rms energy features has also been explored, in an attempt to incorporate some kind of stress
information.
Different combinations of the available features were used to build ve regression trees (T1,
T2, T3, T4 and T5) and investigate phone level feature relevance for duration prediction. The
trees were built using the Matlab implementation of CART [65]. Short pauses (SP) were not
regarded as phones and were modelled independently in a parallel tree TSP. Table 5.6 provides
a more detailed description of the different tree features. The normal speakers training set was
used as training data. Phone segmentation was achieved by force-aligning each sentence with
the baseline recogniser BL. Speaker adapted versions of this model were used for the segmenta-
tion of TE speech.
Tree performance was evaluated against the TE test set, computing the average mean squared
error (MSE) between the mean normal durations used for transplantation and the predicted val-
ues. Results showed that T3, followed by T2 and T1, predicted phone durations closest to the
transplanted values, revealing that phone context and positional information improve duration
prediction (see Table 5.7). However, differences between them were not large and phone iden-
tity appeared to be the most relevant feature in the three cases. On the other hand, the addition
of pitch and energy features decreased performance, showing that linear regression of phone
pitch and intensity contours does not appropriately model lexical stress as we were hoping.
TREE FEATURES
T1 F1 phone identity
T2 F2 F1+prev and next phone identities (converted to broad classes)
T3 F3 F2+position of phone in sentence (rst 5 phones / last 5 phones / rest)
T4 F4 F3+pitch (positive slope / no slope / negative slope)
T5 F5 F4+energy (positive slope / no slope / negative slope)
TSP FS num of phones since prev sp, num of phones until next sp
Table 5.6 Description of trees and corresponding features
T1 T2 T3 T4 T5
MSE [ms] 0.788 0.695 0.570 2.174 1.535
Table 5.7 Evaluation of tree duration prediction
TE sentences whose force-aligned phone durations were substituted by those predicted by
T3 and TSP were informally found to be perceptually indistinguishable from their corresponding
transplanted versions, demonstrating the validity of the adopted duration prediction approach.
However, even when the best recognised segmentations from R4 were used instead of the force-
aligned labels, phone recognition errors caused durational artifacts, emphasizing the need for a
robust modication method capable of taking recognition errors into account.
5.5.3 Robust duration modication
One way to reduce the duration artifacts caused by recognition errors is to incorporate phone
recognition condence information in the repair process, and to modify durations accordingly.
Such a method can be described by the following equation
d
N
= d
P
+ (1 ) d
O
, (5.5)
where d
N
is the new duration, d
P
is the predicted duration, d
O
is the original duration and is
a condence measure.
The main difculty with this technique lies in obtaining appropriate values of . TE phone
duration probability distributions and condence scores can be used to compute the condence
measure. In addition, information on phone confusions can also be incorporated from phone
level confusion networks. An analysis of the correlation between recognition errors and these
features revealed that high condence scores and duration probabilities corresponded to cor-
rectly segmented and labelled phones, while low duration probabilities or low condence scores
generally coincided with insertions, deletions and substitutions. Also, the correct phone was
often included in the phone confusion lists. As a result, was computed as the mean of the
normalised duration probability density and condence score for each phone and d
P
was calcu-
lated as the average of the durations predicted by the confused phones. The described robust
modication (RM) technique was used to modify the durations of the phones recognised by R3
and R4. These systems will be referred to as RM1 and RM2 respectively for comparison purposes
(see Table 5.8).
SYSTEM FEATURES
RM1 R3+RM
RM2 R4+RM
Table 5.8 Robust modication systems
The application of this condence-based smoothing technique very considerably reduced du-
rational artifacts. Even though converted utterances did not perceptually match the transplanted
versions, informal listening showed that the main TE duration deviations previously described
were mostly repaired without additional artifacts, resulting in more natural duration contours
overall. Also, sentences modied with RM1 and RM2 were found to be perceptually almost
indistinguishable.
The performance of the different repair systems has been tested computing the MSE between
the repaired phone durations and the transplantation values. Results in Table 5.9 show that the
proposed repair technique reduces the MSE overall, bringing TE duration contours closer to
those of the average normal speaker. In addition, the application of robust duration modica-
tion further improves results.
SYSTEM MSE [ms]
original TE speech 10.080
R3 5.873
R4 3.913
RM1 3.994
RM2 3.186
Table 5.9 Evaluation of Repair Systems
TE DURATION REPAIR ALGORITHM
Input: normal training data, TE adaptation data, TE test speech
Output: duration repaired TE test speech
// Train CART duration tree
1 T3 normal training data
// Adapt recogniser to TE speaker
2 R3,R4 CMLLR, TE adaptation data
// Duration Repair
3 foreach TE test sentence do
// recognise sentence
4 {ph
1
, ph
2
, , ph
NP
} R3,R4
// predict normal phone durations
5 {d
1
P
, d
2
P
, , d
NP
P
} T3, {ph
1
, ph
2
, , ph
NP
}
// apply Robust Duration Modification
6 d
N
= d
P
+ (1 ) d
O
end
Summary of the TE Duration Repair Algorithm
5.6 Evaluation
As in VC applications, while objective evaluations are useful to measure and compare the effec-
tiveness of the different repair algorithms, subjective perceptual evaluations are still required to
assess if the developed repair methods are actually capable of achieving the desired improve-
ments. Therefore, a listening test was carried out in order to evaluate the implemented voice
source and duration repair algorithms. The test was designed to assess
1. Which glottal replacement method, CPCLP plus xed LF replacement or JEAS replacement,
produces more natural sounding repaired speech.
2. If the quality of the voice source repaired utterances is perceived to be less deviant than
the original TE speech samples.
3. If the developed duration repair algorithm is capable of generating more normal rhythmic
patterns than the original TE ones.
4. The quality of the voice source plus duration repaired TE speech in terms of naturalness,
intelligibility and rhythm.
The perceptual test consisted of ve sections. Because the 30 subjects who participated in the
perceptual study were naive listeners, the rst section consisted of a training phase to familiarise
them with TE speech and the concepts of speech quality and rhythm. Subjects were presented
with TE speech samples with easily distinguishable quality or rhythm patterns and asked to
choose the ones they thought had a less deviant, ugly and unpleasant quality or a more normal
rhythm. The responses given during the training task were discarded. The rest of the sections
aimed at addressing each of the enumerated issues.
Five sentences produced by the twelve TE speakers were repaired with the described voice
source and duration repair techniques and used for evaluation. So that the length of the study
was not compromised, three test versions containing different subsamples of the evaluation set
were administered to three subject subgroups. Every test version presented the subjects with all
the stimuli of four TE speakers required for the evaluation of each section. Both the identities
of the speakers per section and the order of the samples were randomised in each version. As
a result, every subject evaluated different attributes of the twelve TE speakers. This design
facilitated the consistency of the subject responses per TE speaker while also exposing them to
the whole range of qualities, serving as a reference for the quality ranking section. Overall, each
tested stimuli was assessed by a total number of 10 listeners.
The perceptual test was administered in a quiet room and over headphones. Subjects could
replay the stimuli as many times as they needed before making a decision. They were also
allowed to take short breaks whenever they considered it necessary. Details of the different
sections of the perceptual study and the obtained results are described next.
1 2 3 4 5 6 7 8 9 11 12 13
0
10
20
30
40
50
60
70
80
90
100
TE speaker
p
e
r
c
e
n
t
a
g
e
CPCLP
JEAS
Figure 5.8 CPCLP vs. JEAS Glottal Replacement Evaluation
5.6.1 CPCLP vs. JEAS glottal replacement
In order to compare the performance of the CPCLP plus xed LF and JEAS replacement methods,
two sentences per speaker where repaired with each approach. The enhanced CPCLP system
was employed in the comparison, i.e. LSF smoothing, tilt reduction and a xed aspiration noise
source were applied. In addition, in both methods jitter and shimmer were also reduced and the
F0 of speakers 2, 6, 11, 8 and 9 were modied to match average normal values.
Subjects were presented with CPCLP and JEAS repaired utterance pairs and asked to choose
the one they found was more natural. Figure 5.8 shows the distribution of the results for each
TE speaker. As it can be observed, there was a clear preference for the JEAS method overall,
which was chosen instead of CPCLP glottal replacement 77% of the time on average. Results
slightly varied per speaker, ranging from 95% for speaker 6 to 60% for speaker 9.
Perceptual evaluation thus conrmed that accurate source-lter deconvolution and the use
of LF waveforms which match the TE glottal tilt result in more natural repaired speech than
simply employing a xed synthetic glottal source.
5.6.2 Evaluating Voice Source Repair
Whilst informally voice source repair was found to alter some of the perceptual qualities char-
acteristic of TE speech, i.e. repaired speech sounded less noisy, breathy, rough, creaky and low
in general, the question whether the quality of the repaired samples was perceived to be less
deviant still needed to be answered.
In order to evaluate the performance of the developed voice source repair approach, ve
sentences per TE speaker where repaired using the JEAS glottal replacement, jitter and shim-
mer reduction and F0 raising techniques. Those where then paired against corresponding JEAS
analysed-resynthesised utterances. Analysed-resynthesised versions of the TE samples were em-
ployed instead of the original ones in order to rule out any bias towards unprocessed speech.
1 2 3 4 5 6 7 8 9 11 12 13
0
10
20
30
40
50
60
70
80
90
TE speaker
p
e
r
c
e
n
t
a
g
e
AS
Voice Source Repaired
Figure 5.9 Voice Source Repair Evaluation
Subjects were then asked to listen to each speech pair and choose the one they found was less
deviant, ugly and unpleasant to listen to.
As shown in Figure 5.9, repaired speech was preferred 76% of the time on average. Again,
differences were found among speakers and preferences range from 60% for speakers 6 and 11
to 88% in speakers 1, 5 or 13.
In order to better analyse the causes of the speaker varying voice source repair results, TE
speakers were classied into ve different groups according to the original values of the repaired
features. The average value and variability of the fundamental frequency, spectral stability and
amount of aspiration noise were employed as classication criteria. The resulting groups were
as follows:
Group 1: comprised the most procient TE speakers 1, 5 and 13, who present a stable F0
whose mean value is within normal ranges, smooth spectral variations and little aspiration
noise (for TE standards).
Group 2: speech samples in this group also have a quite stable and normal fundamental
frequency and small amount of aspiration noise. However, they are characterised by higher
cycle-to-cycle spectral variations and a bubbly or wet quality, mainly thought to be caused
by mucus and sputum interfering with speech production. Speakers 3 and 12 belong to
this group.
Group 3: formed by speakers 2, 6 and 11, this group presents very low and more variable
F0 but is otherwise spectrally stable and does not have too much aspiration noise.
Group 4: both female TE speakers 8 and 9 have been grouped together because of their
low average F0 compared to normal female speakers. However, apart from this fact, the
characteristics of this group would be the same as those of Group 1.
Group 5: is composed of those speakers, patients 4 and 7, with higher amount of aspiration
noise but yet normal and stable average fundamental frequency and spectral characteristics
One-way analysis of variance (ANOVA) was then used to investigate the relations between
the ve TE speaker groups and the extent of achieved perceptual improvement. The correlation
between the success of repair and the speaker groups was found to be signicant with a p-value
of 0.012. Best results were obtained for Group 1, followed by Groups 4, 2, 5, and 3. These results
indicate that in general, the more procient speakers, both male and female, benet most from
voice source repair. In contrast, the improvement achieved for TE speakers with higher deviant
source features is more modest. In particular, the preference for the repaired utterances drops
to 60% for TE speech with very low and variable F0. This is thought to be due to the synthetic
quality introduced by the more extreme modications, which is sometimes not preferred over
the original samples.
5.6.3 Evaluating Duration Repair
The objective measurements described in Section 5.5 revealed that the proposed duration repair
algorithm is capable of bringing the TE duration contours closer to the average normal patterns.
Smallest mean squared errors are obtained when recognisers adapted to each TE speaker and
constrained by a language model are used to recognise and segment utterances into phone
sequences, decision trees taking phone identity, context and positional features into account are
employed and a robust modication approach is applied to deal with possible recognition and
segmentation errors.
In order to evaluate the perceptual effect of the developed algorithm, the duration pattern
of ve sentences per TE speaker was transformed using system RM1 and decision tree T3 (see
Section 5.5). Subjects were presented with pairs of voice source repaired sentences having the
original and the repaired duration contours. They were then asked to listen to each speech pair
and choose the one they found had a more normal rhythm. Overall, subjects found it quite
difcult to distinguish between the different rhythmic patterns.
The distribution of the results is shown in Figure 5.10. As can be seen, the preference for the
repaired samples did not only vary per speaker, but also per sentence. Only for speaker 5 were
the duration repaired versions consistently chosen, 68% of the time on average but up to 80%
in some sentences. Speakers 7 and 11 present dichotomic distributions, where some repaired
sentences are preferred 80% and 70% of the time respectively while others are not chosen very
often. In other cases, speakers 1, 3 and 8 for example, repeated percentages of 50% demonstrate
that subjects did not have a clear preference for either duration pattern. The original utterances
were preferred in the rest of the speakers overall.
The variability of the duration repair results shows that the proposed method is only success-
ful in some cases. Analysis of the rhythmic pattern of speaker 5 revealed that it was consistently
slow and with long pauses in between phrases. In addition, it is one of the most procient TE
speakers classied into Group 1 in the previous section, which probably contributed to better
1 2 3 4 5 6 7 8 9 11 12 13
0
10
20
30
40
50
60
70
80
90
100
TE speaker
p
e
r
c
e
n
t
a
g
e
sentence 23
sentence 25
sentence 28
sentence 33
sentence 47
Figure 5.10 Duration Repair Evaluation
phone recognition and segmentation and as a result, did not present important durational arti-
facts. On the other hand, the duration patterns of those sentences in which the original samples
were preferred were generally found to be not too deviant. The durational artifacts caused by
recognition and segmentation errors probably favoured the preference towards the original ut-
terances in these cases. Therefore, the amount of deviancy of the original sentences and the
acoustic characteristics of the TE speakers are thought to inuence the outcome of the duration
repair algorithm. In this sense, an improvement of the recognition step is expected to yield
further improvements.
5.6.4 Ranking Quality
In the last section of the perceptual test, subjects were asked to judge the naturalness, intelligi-
bility and rhythm of voice source and duration repaired TE speech. This was done in order to
assess the overall impression naive listeners have of the output of the developed repair system.
Two voice source and duration repaired sentences were pasted together for each speaker and
used as stimuli. Then, subjects were asked to rank their following properties:
Naturalness: how articial/normal does the speech sound? (1-very articial, 2-articial,
3-quite normal, 4-normal, 5-very normal)
Intelligibility: how difcult/easy is it to understand? (1-very difcult, 2-difcult, 3-quite
easy, 4-easy, 5-very easy)
Rhythm: how unnatural/natural is its rhythmic pattern? (1-very unnatural, 2-unnatural,
3-quite natural, 4-natural, 5-very natural)
The distribution of the average mean opinion scores (MOS) per TE speaker is shown in Figure
5.11. Average naturalness, intelligibility and rhythm scores are 2.3, 2.6 and 2.6 respectively. But
1 2 3 4 5 6 7 8 9 11 12 13
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
TE speaker
M
O
S
naturalness
intelligibility
rhythm
Figure 5.11 Mean Opinion Naturalness, Intelligibility and Rhythm Scores of repaired TE speech
again, differences exist among speakers. Scores go up to 3.70, 3.70 and 3.60 for one of the most
procient TE speakers, patient 1. On the other hand, intelligibility of speaker 6 is ranked as very
difcult to understand, i.e. 1.10.
5.7 Discussion
The TE speech repair approach presented in this chapter has attempted to repair the two main
limitations of TE speech, i.e voice source and duration.
The developed voice source repair algorithm modies the acoustic voice source characteris-
tics found to deviate in TE speech, bringing them closer to the average normal values. The shape
of the glottal waveform, jitter and shimmer, fundamental frequency and the amount of aspiration
noise are the features for which a repair has been attempted. Two different glottal replacement
methods have been explored, i.e. CPCLP plus xed LF replacement and JEAS replacement, of
which JEAS replacement has been found to produce more natural sounding repaired speech.
This has also been found to be perceived as less deviant, ugly and unpleasant than the original
TE speech overall. However, the performance of the method depends on the quality of the orig-
inal TE speech samples. In general, more natural sounding speech is obtained when repairing
the more procient TE speakers, while transformation of more extreme deviations results in a
synthetic quality which is sometimes not preferred over the original.
For duration repair, a different approach has been followed. Rather than only modifying
the deviant duration features, the whole utterance duration pattern is transformed to match an
average normal duration contour. Durations are predicted from a decision tree trained on nor-
mal speech. Different phone-based features have been explored, among which phone identity,
context and positional characteristics have been found to be good predictors of normal phone
durations. The main difculty within this method lies in obtaining accurate phone recognition
and segmentations from TE speech in real-time. Various speech recognition techniques have
been investigated and the use of adaptation and language modelling has been shown to increase
performance. In order to make the duration repair algorithm more robust to recognition and
segmentation mistakes, a modication technique which takes the condence of recognition into
account has been developed. The duration repair algorithm has been found to be successful
only in some cases. In general, duration repaired utterances are preferred when the original TE
duration patterns are quite deviant and the recogniser does not introduce signicant durational
artifacts. However, when the original TE rhythm is not too deviant, it is preferred over the re-
paired version if artifacts are introduced in the recognition step. Improvements in TE speech
recognition are thus expected to increase the overall performance of the algorithm.
Whilst the proposed approach has been shown to be capable of repairing the deviant TE
features, the quality of the output speech is still quite far from natural. In fact, the naturalness,
intelligibility and rhythmic properties of the repaired sentences have been rated as poor overall.
Due to its synthetic quality, speech repaired with the developed technique would probably not
be preferred over the more natural sounding but deviant original for interaction with familiar
listeners who are used to the quality of the TE voices. However, for short interactions with
strangers, where the listeners do not have the time to perceptually adapt to the different TE
qualities, as long as it is intelligible, the synthetic quality of the output would not be a big
problem and the repair algorithm might facilitate the communication.
6
Conclusions and Future Work
6.1 Conclusions
This dissertation began with an overview of the speech models and feature transformation tech-
niques employed in state-of-the-art VC systems and a discussion of the issues which limit their
application to more extreme transformation tasks such as accent and emotional conversion or
voice repair. The main problems lie in the modelling and conversion of the voice source and
prosody. In this thesis, a novel speech model and rened voice source and duration conversion
techniques have been proposed to address these issues. In addition, the developed speech model
and duration conversion method have been applied to repair TE speech. The following sections
summarise and discuss the results of this research and propose directions for future work.
6.1.1 Joint Estimation Analysis Synthesis
The Source-Filter and Sinusoidal Models are widely used to represent and manipulate the speech
signal. Since the Source-Filter representation is based on how speech is acoustically produced,
it allows a simple manipulation of the most relevant speaker identifying features, i.e. spectral
envelopes and average fundamental frequency. However, the use of Linear Prediction (LP) to
estimate the vocal tract lter parameters oversimplies the modelling of the voice source and
results in poor speech quality. On the other hand, Sinusoidal Modelling is capable of producing
speech almost indistinguishable from the original at the expense of a more complex represen-
tation. For this reason, it is most widely used in VC applications requiring high-quality output.
Nevertheless, Sinusoidal models still employ a source-lter decomposition based on LP for spec-
tral envelope and prosodic transformations, which re-introduces the problems related to over-
simplied voice source modelling. In addition, conversion of the sinusoidal phase components
is also a problem which causes artifacts.
The main drawback of the Source-Filter and Sinusoidal speech representations for VC is thus
the inaccurate modelling of the voice source assumed during source-lter deconvolution. The
developed Joint Estimation Analysis Synthesis (JEAS) modelling approach solves this issue by
using the LF model to describe the glottal source and employing a convex optimization method
101
CHAPTER 6. CONCLUSIONS AND FUTURE WORK 102
to obtain automatic and simultaneous voice source and vocal tract lter parameterisations. The
JEAS model has been shown to be capable of producing speech almost indistinguishable from
the original and to achieve high-quality time and pitch-scale modications. In addition, the LF
parameterisation of the glottal source allows the use of linear transformations for voice source
conversion and because the model follows a time-domain source-lter representation, it does
not require phase conversion. These are advantageous features for VC. On the negative side,
JEAS modelling needs glottal closure instants (GCI) to be estimated and the processing time of
the joint estimation procedure is higher than that of LP. However, these difculties can be easily
solved ofine in VC applications.
6.1.2 JEAS Voice Source and CART Duration Modelling for VC
The LF parameterisation of the glottal source obtained by JEAS Modelling offers voice source
conversion capabilities that Sinusoidal Models cannot provide. Sinusoidal VC systems have de-
veloped residual prediction methods based on the correlation between spectral envelope and LP
residuals to reintroduce the target spectral detail lost after envelope conversion. Because resid-
uals contain the errors introduced by the LP parameterisation, residual prediction techniques
have been found to improve conversion performance. However, LP residuals do not constitute
an accurate model of the voice source and residual prediction alone is not capable of modifying
the quality of the voice source. This prevents their use in applications requiring voice quality
modications such as, for example, TE speech repair. On the contrary, the LF model has been
shown to capture voice quality differences and thus, its conversion is expected to achieve voice
source quality transformations. In fact, JEAS LF voice source modelling and linear transforma-
tions have shown to be capable of bringing the source speakers glottal waveform characteristics
closer to the targets and to reduce the log spectral distortion of spectral envelope conversions
further.
In terms of speaker recognizability, JEAS spectral envelope and glottal waveform conversion
have been found to be comparable to a state-of-the-art Sinusoidal VC implementation applying
standard envelope conversion and residual and phase prediction techniques. Furthermore, re-
garding the quality of the converted samples, JEAS VC has been found to be preferred over the
sinusoidal PSHM system, mainly because it lacks the noisy quality produced by the sinusoidal
amplitude and phase mismatches.
Among the prosodic features, this work has looked at ways of transforming duration char-
acteristics which have received little attention in existing VC implementations. The use of
CART trees traditionally employed to model and predict duration in TTS synthesisers has shown
promising results in a closed set VC experiment. The proposed duration conversion technique is
expected to perform similarly if sufcient target data is employed to train the decision trees. The
method is not limited to altering the speaking rate of the source speaker, but it can also modify
sentence rhythm. The addition of context and positional features to phone identities has been
found to reduce duration distortion ratios and to increase the correlation between the converted
and the target duration contours overall. It has also been shown to be applicable to repair the
duration characteristics of TE speech.
6.1.3 Tracheoesophageal Speech Repair
The developed JEAS model and CART duration conversion method have been tested in an ex-
treme application: the repair of TE speech.
Regarding the voice source, glottal replacement, jitter and shimmer smoothing, raising the
fundamental frequency of speakers with very low pitch and reduction of the aspiration noise
component have been shown to produce less deviant, ugly and unpleasant repaired speech than
the original overall. In addition, the glottal replacement method based on JEAS modelling
has been found to achieve more natural sounding speech than a previously proposed xed LF
replacement technique. However, the proposed voice source repair algorithms assume that TE
GCIs are given, which cannot still be very reliably estimated automatically.
Regarding duration, a CART decision tree trained on normal data has been employed to re-
pair TE duration contours. Whilst the method was informally found to work reasonably when
force aligned segmentations were used, these would not be available in a real-time repair appli-
cation. Therefore, various TE speech recognisers and a robust duration modication technique
which takes recognition condence into account have been investigated. The best duration
repair results have been obtained when phone identity, context and positional features were
used to train the decision trees, adaptation and language modelling were employed to train the
recogniser and robust modication was applied.
The most relevant conclusion obtained from the perceptual evaluations of the voice source
and duration repair algorithms is that their performance is highly dependent on the quality of
the original TE speech. For voice source repair, most natural sounding speech is achieved when
repairing the most procient TE speakers while the transformation of the more extreme devi-
ations present in the least procient speakers results in synthetic qualities which are often not
preferred over the originals. For duration repair, the method is successful only in some cases,
generally when the original duration pattern is quite deviant and the recogniser does not intro-
duce considerable durational artifacts. In this sense, improvement of the TE speech recognition
step is expected to increase the performance of the method. Nevertheless, it should be noted
that the repaired utterances are still perceived as poor in terms of naturalness, intelligibility
and rhythm overall, which suggests future work should be done to improve the quality of the
repaired speech.
6.2 Future Work
The speech model, voice source and duration transformation techniques proposed in this work
to make VC systems more robust to extreme applications have shown promising performance in
repairing TE speech. However, there is still considerable room for further improvement mostly
regarding TE speech repair. The areas which would greatly benet from future work are briey
discussed in this section.
Automatic TE GCI estimation JEAS modelling requires glottal closure instants to be estimated
for source-lter deconvolution and LF tting. While algorithms such as DYPSA [71] exist
which are capable of obtaining reliable GCIs for the processing of normal speech, they do
not work as well with TE speech. For this reason, GCIs have been marked manually to
avoid the artifacts caused by automatic estimation methods in the experiments presented
in this dissertation. However, manual GCI extraction is a time consuming task which
prevents the use of the proposed voice source repair techniques in real-time applications.
Despite being noisier and less periodic than normal, TE GCIs can still be relatively easily
spotted manually, which suggests that the development of automatic TE GCI estimation
methods is a difcult but possible task.
TE speech recognition The main limitations of the proposed TE duration repair algorithm are
the durational artifacts produced by recognition and segmentation errors. These are due
to the poor performance of the TE speech recognition step. Building a speech recogniser
capable of more accurately recognising and segmenting TE phones would increase the per-
formance of the method and reduce the need for robust modication. Using more TE data
for adaptation or building TE speaker-dependent speech recognisers are possibilities worth
exploring. Sufcient training data could for example be acquired from the interaction be-
tween the TE user and the system in a real-time application.
TE spectral envelope conversion Voice source and duration are the most relevant limitations
TE speakers present after laryngectomy. However, when surgery is too extensive and re-
moval of organs surrounding the laryngx is involved, the spectral and articulatory features
of the resulting speech are modied as well. In these cases, conversion of spectral en-
velopes might increase the naturalness of the repaired speech. Furthermore, if enough
normal speech data of the patient were available, spectral envelope conversion could be
used to attempt to obtain repaired speech recognisable as that of the TE speaker before
surgery.
Usability evaluation The proposed TE repair algorithms have been evaluated by comparing the
quality and rhythmic characteristics of the original and repaired samples and ranking the
naturalness, intelligibility and rhythm of the repaired speech. However, the usability of
the repair algorithms has not been tested in a real application yet. Whilst the quality of
the repaired speech has been described as poor overall, its use might still help improve
communication in, for example, a telephone conversation between a TE speaker and a
stranger. Simulating a real task during evaluation would give insight into the usefulness
of the repair approach in real applications.
Bibliography
[1] WSJCAM0 Corpus. http://svr-www.eng.cam.ac.uk/ ajr/wsjcam0/wsjcam0.html.
[2] Cancer Research UK. http://cancerresearchuk.org, 2007.
[3] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara. Voice Conversion through Vector
Quantization. In Proc. ICASSP, pages 565568, 1988.
[4] M. Airas and P. Alku. Comparison of Multiple Voice Source Parameters in Different Phona-
tion Types. In Proc. Interspeech, 2007.
[5] P. Alku. Glottal wave analysis with pitch synchronous iterative adaptive inverse ltering.
Speech Communication, 11:109118, 1992.
[6] P. Alku, T. Backstrom, and E. Vilkman. Normalized amplitude quotient for parametriza-
tion of the glottal ow. J. Acoust. Soc. Am., 112(2):701710, 2002.
[7] P. Alku and E. Vilkman. Estimation of the glottal pulseform based on discrete all-pole
modelling. In Proc. ICSLP, pages 16191622, Yokohama, 1994.
[8] L.M. Arslan. Speaker transformation algorithm using segmental codebooks (STASC).
Speech Communication, 28(3):211226, 1999.
[9] L.M. Arslan and D. Talkin. Voice conversion by codebook mapping of line spectral fre-
quencies and excitation spectrum. In Proc. EUROSPEECH, pages 13471350, 1997.
[10] B.S. Atal and S.L. Hanauer. Speech Analysis and Synthesis by Linear Predictive Coding of
the Speech Wave. J. Acoust. Soc. Am., 50(2):637655, 1971.
[11] G. Bertino, A. Bellomo, C. Miani, F. Ferrero, and A. Staferi. Spectrographic differences
between tracheal-esophageal and esophageal voice. Folia Phoniatrica et Logopaedica,
48:255261, 1996.
[12] D.M. Brookes and D.S. Chan. Speaker Characteristics from a Glottal Airow Model using
Glottal Inverse Filtering. Proc. Institute of Acoustics, 15:501508, 1994.
105
Bibliography 106
[13] W. Campbell. Syllable-based Segmental Durations. In G. Bailly, C. Beno and T. Sawallis,
editor, Talking machines: Theories, models and designs, pages 4360. Cambridge University
Press, 1992.
[14] D.T. Chappel and J.H.L. Hansen. Speaker-specic pitch contour modelling and modica-
tion. In Proc. ICASSP, pages 885888, 1998.
[15] J.H. Chen and A. Gersho. Real-time vector APC speech coding at 48000 bps with adaptive
postltering. In Proc. ICASSP, volume 29(5), pages 21852188, 1987.
[16] D.G. Childers. Glottal source modeling for voice conversion. Speech Communication,
16:127138, 1995.
[17] D.G. Childers and C. Ahn. Modeling the glottal volume-velocity waveform for three voice
types. J. Acoust. Soc. Am., 97(1):505519, 1995.
[18] D.G. Childers and C.K. Lee. Vocal quality factors: analysis, synthesis and perception. J.
Acoust. Soc. Am., 90(5):23942410, 1991.
[19] P. Cook. Identication of Control Parameters in an Articulatory Vocal Tract Model with
Applications to the Synthesis of Singing. PhD thesis, Stanford University, 1990.
[20] F. Debruyne, P. Delaere, J. Wouters, and P. Uwents. Acoustic analysis of tracheo-
oesophageal versus oesophageal speech. J. of Laryngology and Otology, 108:325328,
1994.
[21] A. del Pozo and S. Young. Continuous Tracheoesophageal Speech Repair. In Proc. EU-
SIPCO, 2006.
[22] A. del Pozo and S. Young. Repairing Tracheoesophageal Speech Duration. In Proc. Speech
Prosody, 2008.
[23] J.R. Deller, J.G. Proakis, and J.H.L. Hansen. Discrete-time processing of speech signals.
Macmillan, 1993.
[24] W. Ding, N. Campbell, N. Higuchi, and H. Kasuya. Fast and robust joint optimization of
vocal tract and voice source parameters. In Proc. ICASSP, volume 2, pages 12911294,
1997.
[25] B. Doval and C. d
Alessandro. Spectral correlates of glottal waveform models: an analytic

study. In Proc. ICASSP, pages 446452, 1997.
[26] B. Doval, C. dAlessandro, and N. Henrich. The spectrum of glottal ow models. Acta
Acustica united with Acustica, 92(6):10261046, 2006.
[27] A. El-Jaroudi and J. Makhoul. Discrete all-polel modelling. IEEE Trans. on Signal Process-
ing, 39:411423, 1991.
Bibliography 107
[28] G. Fant. Acoustic Theory of Speech Production. The Netherlands: Mouton-The Hague,
1970.
[29] G. Fant. Glottal source and excitation analysis. STL-QPSR, 1:85107, 1979.
[30] G. Fant. The LF-model revisited. Transformations and frequency domain analysis. STL-
QPSR, pages 119156, 1995.
[31] G. Fant. The voice source in connected speech. Speech Communication, 22:125139,
1997.
[32] G. Fant, A. Kruckenberg, J. Liljencrants, and M. Bavegdrd. Voice source parameters in
continuous speech. Transformation of LF parameters. In Proc. ICSLP, pages 14511454,
Yokohama, 1994.
[33] G. Fant, J. Liljencrants, and Q. Lin. A four-parameter model of glottal ow. STL-QPSR,
pages 113, 1985.
[34] M. Frohlich, D. Michaelis, and H.W. Strube. SIM-simultaneous inverse ltering and
matching of a glottal ow model for acoustic speech signals. J. Acoust. Soc. Am.,
110(1):479488, 2001.
[35] Q. Fu and P. Murphy. Robust Glottal Source Estimation Based on Joint Source-Filter
Model Optimization. IEEE Trans on Audio, Speech and Language Processing, 14(2):492
501, 2006.
[36] H. Fujisaki and M. Ljungqvist. Proposal and evaluation of models for the glottal source
waveform. In Proc. ICASSP, pages 31.2.131.2.4, 1986.
[37] H. Fujisaki and M. Ljungqvist. Estimation of voice source and vocal tract parameters
based on ARMA analysis and a model for the glottal source waveform. In Proc. ICASSP,
pages 15.4.15.4.4, 1987.
[38] S. Furui. Research on individuality features in speech waves and automatic speaker recog-
nition techniques. Speech Communication, 5:183197, 1986.
[39] E. George and M. Smith. Speech Analysis/Synthesis and Modication using an Analysis-
by-Synthesis/Overlap-Add Sinusoidal Model. IEEE Trans. on Speech and Audio Processing,
5(5):389406, 1997.
[40] C. Gobl and A. N Chasaide. Techniques for analysing the voice source. In W.J. Hardcas-
tle and N. Hewlett, editor, Coarticulation: Theory, Data and Techniques, pages 300320.
Cambridge University Press, 1999.
[41] C. Gobl and A. N Chasaide. Testing affective correlates of voice quality through analysis
and resynthesis. In ITRW on Speech and Emotion, pages 178183, 2000.
Bibliography 108
[42] T. Haderlein, S. Steidl, E. Nth, F. Rosanowski, and M. Schuster. Automatic Recognition
and Evaluation of Tracheoesophageal Speech. Text, Speech and Dialogue, Proceedings LNAI
3206, Springer, Berlin, Heidelberg, pages 331338, 2004.
[43] N. Henrich, C. d
Alessandro, and B. Doval. Spectral correlates of voice open quotient

and glottal ow asymmetry:theory, limits and experimental data. In Proc. EUROSPEECH,
2001.
[44] J.N. Holmes. An Investigation of the Volume Velocity Waveform at the Larynx during
Speech by Means of an Inverse Filter. In Proc. of the 4th Int. Congr. Acoust., Copenhagen,
1962.
[45] Z. Inanoglu and S. Young. A System for Transforming the Emotion in Speech: Combining
Data-Driven Conversion Techniques for Prosody and Voice Quality. In Proc. Interspeech,
pages 490493, 2007.
[46] K. Itoh and S. Saito. Effects of acoustical feature parameters on perceptual speaker iden-
tity. Review of Electrical Communications Laboratories, 36(1):135141, 1988.
[47] C.C. Johnson, H. Hollien, and Hicks J.W. Speaker identication utilizing selected tempo-
ral speech features. Journal of Phonetics, 12:319326, 1984.
[48] A. Kain. High Resolution voice transformation. PhD thesis, Oregon Health and Science
University, 2001.
[49] A. Kain and M.W. Macon. Spectral voice conversion for text-to-speech synthesis. In Proc.
ICASSP, pages 285288, 1998.
[50] A. Kain and M.W. Macon. Design and evaluation of a voice conversion algorithm based
on spectral envelope mapping and residual prediction. In Proc. ICASSP, 2001.
[51] D.Y. Kim, H.Y. Chan, G. Evermann, M.J.F. Gales, D. Mrva, K.C. Sim, and P.C. Woodland.
Development of the CU-HTK 2004 Broadcast News Transcription System. In Proc. ICASSP,
pages 861864, 2005.
[52] D.H. Klatt and L.C. Klatt. Analysis, synthesis and perception of voice quality variations
among female and male talkers. J. Acoust. Soc. Am., 87(2):820857, 1990.
[53] A.K. Krishnamurthy. Glottal Source Estimation Using a Sum-of-Exponentials Model. IEEE
Trans on Signal Processing, 40(3):682686, 1992.
[54] H. Kuwabara and Y. Sagisaka. Acoustic characteristics of Speaker Individuality: Control
and Conversion. Speech Communication, 16(2):165173, 1995.
[55] J. Laroche, Y. Stylianou, and E. Moulines. HNS: Speech Modication Based on a Harmonic
plus Noise Model. In Proc. ICASSP, volume 2, pages 550553, 1993.
Bibliography 109
[56] J. Laver. The phonetic description of voice quality. Cambridge University Press, 1980.
[57] K.S. Lee, D.H. Youn, and I.W. Cha. A new voice transformation method based on both
linear and nonlinear prediction analysis. In Proc. ICSLP, pages 14011404, 1996.
[58] Q. Lin. Speech Production theory and Articulatory Speech Synthesis. PhD thesis, Dept.
of Speech Communication and Music Acoustics, Royal Inst. of Technology (KTH), Stock-
holm, 1990.
[59] A. Loscos and J. bonada. Emulating rough and growl voice in spectral domain. In Proc.
of the 7th International Conference on Digital Audio Effects, Naples, Italy, 2004.
[60] H.L. Lu. Toward a High-Quality Singing Synthesizer with Vocal Texture Control. PhD thesis,
Stanford University, 2002.
[61] H.L. Lu and J.O. Smith. Joint Estimation of Vocal Tract Filter and Glottal Source Waveform
via Convex Optimization. In Proc. IEEE workshop on applications of Signal Processing to
Audio and Acoustics, pages 7982, 1999.
[62] H.L. Lu and J.O. Smith. Glottal source modeling for singing voice synthesis. In Proc.
International Computer Music Conference, pages 7982, 2000.
[63] C.X. Ma, Y. Kamp, and L.F. Willems. A Frobenius Norm Approach to Glottal Closure
Detection from the Speech Signal. IEEE Trans. on Speech and Audio Processing, 2:258
265, 1994.
[64] The Mathworks. MATLAB Optimization Toolbox.
[65] The Mathworks. MATLAB Statistical Toolbox.
[66] R. McAulay and T. Quatieri. Speech Analysis/Synthesis Based on a Sinusoidal Represen-
tation. IEEE Trans. on Acoustics, Speech and Signal Processing, pages 744754, 1986.
[67] J.G. McKenna. Automatic glottal closed-phae location and analysis by Kalman ltering.
In Proc. 4th ISCA Tutorial and Research Workshop on speech Synthesis, pages 9196, 2001.
[68] R.L. Miller. Nature of the Vocal Cord Wave. J. Acoust. Soc. Am., 31:667677, 1959.
[69] E. Moore and M. Clements. Algorithm for automatic glottal waveform estimation without
the reliance on precise glottal closure information. In Proc. ICASSP, 2004.
[70] E. Moulines and F. Charpentier. Pitch-Synchronous waveform processing techniques for
text-to-speech synthesis using diphones. Speech Communication, 9:453467, 1990.
[71] P.A. Naylor, A. Kounoudes, J. Gudnason, and M. Brookes. Estimation of Glottal Closure
Instants in Voiced Speech using the DYPSA Algorithm. IEEE Trans on Speech and Audio
Processing, 15:3443, 2007.
Bibliography 110
[72] J. Perez and A. Bonafonte. Automatic Voice-Source Parameterization of Natural Speech.
In Proc. Interspeech, 2005.
[73] H.R. Ptzinger. Intrinsic Phone Durations are Speaker-Specic. In Proc. ICSLP, volume 2,
pages 11131116, 2002.
[74] Y. Qi. Replacing tracheoesophageal voicing sources using LPC synthesis. J. Acoust. Soc.
Am., 88:12281235, 1990.
[75] Y. Qi and B. Weinberg. Characteristics of voicing source waveforms produced by
esophageal and tracheoesophageal speakers. Journal of Speech and Hearing Research,
38:536548, 1995.
[76] Y. Qi, B. Weinberg, and N. Bi. Enhancement of female esophageal and tracheoesophageal
speech. J. Acoust. Soc. Am., 98:24612465, 1995.
[77] T. Quatieri and R. McAulay. Speech Transformations Based on a Sinusoidal Representa-
tion. IEEE Trans. on Acoustics, Speech and Signal Processing, 34.
[78] T. Quatieri and R. McAulay. Shape Invariant Time-Scale and Pitch Modication of Speech.
IEEE Trans. on Signal Processing, 40:497510, 1992.
[79] M.D. Riley. Tree-based modeling for speech synthesis. In G. Bailly, C. Beno and T. Sawallis,
editor, Theories, models and designs, pages 265273. 1992.
[80] J. Robbins, H.B. Fisher, E.C. Blom, and M.I. Singer. A comparative acoustic study of
normal, esophageal and tracheoesophageal speech production. Journal of Speech and
Hearing Disorders, 49:202210, 1984.
[81] X. Serra and J. Smith. Spectral Modeling Synthesis:A Sound Analysis/Synthesis Based
on a Deterministic plus Stochastic Decomposition. Computer Music Journal, 14.4:1224,
1990.
[82] M.I. Singer and E.D. Blom. An endoscopic technique for restoration of voice after laryn-
gectomy. Annals of Otology, Rhinology and Laryngology, 89:529533, 1980.
[83] L.F. Smith and K.H. Calhoun. Intelligibility of tracheoesophageal speech among naive
listeners. Southern Medical Journal, 87(3):333336, 1994.
[84] R. Smits and B. Yegnanarayana. Determination of Instants of Signicant Excitation in
speech Using Group Delay Function. IEEE Trans. on speech and Audio Processing, 3:325
333, 1995.
[85] Cranen B. Strik, H. and L. Boves. Fitting a LF-model to inverse lter signals. In Proc.
EUROSPEECH, pages 103106, 1993.
Bibliography 111
[86] H. Strik. Automatic parametrization of differentiated glottal ow: Comparing methods
by means of synthetic ow pulses. J. Acoust. Soc. Am., 103(5):26592669, 1998.
[87] H. Strik and L. Boves. Automatic Estimation of Voice Source Parameters. In Proc. ICSLP,
1994.
[88] Y. Stylianou, O. Capp, and E. Moulines. Continuous Probabilistic Transforms for Voice
Conversion. IEEE Trans. on Acoustics, Speech and Signal Processing, 6(2):131142, 1998.
[89] Y. Stylianou, J. Laroche, and E. Moulines. High-Quality Speech Modication Based on a
Harmonic plus Noise Model. In Proc. EUROSPEECH, pages 451454, 1995.
[90] D. Suendermann, A. Bonafonte, H. Ney, and H. Hoege. A Study on Residual Prediction
Techniques for Voice Conversion. In Proc. ICASSP, volume I, pages 1316, 2005.
[91] D. Suendermann, H. Hoege, A. Bonafonte, H. Ney, and A. W. Black. Residual Prediction
Based on Unit Selection. In Proc. ASRU, pages 369374, 2005.
[92] T. Toda, A.W. Black, and K. Tokuda. Spectral Conversion Based on Maximum Likelihood
Estimation Considering Global Variance of Converted Parameter. In Proc. Interspeech,
volume I, pages 912, 2005.
[93] M.D. Trudeau and Y. Qi. Acoustic characteristics of female tracheoesophageal speech.
Journal of Speech and Hearing Disorders, 55:244250, 1990.
[94] O. Turk. New methods for voice conversion. MPhil Thesis ,Bogazici University, 2003.
[95] O. Turk. Cross-Lingual Voice Conversion. PhD thesis, Bogazici University, 2007.
[96] M. Unser, A. Aldroubi, and M. Eden. B-Spline Signal Processing. IEEE Trans. on Signal
Processing, 41(2).
[97] C.J. Van As. Tracheoesophageal speech: A multidimensional assessment of voice quality. PhD
thesis, University of Amsterdam, 2001.
[98] J. Van den Berg. Myoelastic-aerodynamics theory of voice production . Journal of Speech
and Hearing Research , 1:227244, 1957.
[99] J.P.H. van Santen. Assignment of segmental duration in text-to-speech synthesis. Com-
puter Speech and Language, 8:95128, 1994.
[100] A. Verma and A. Kumar. Introducing roughness in individuality transformation through
jitter modeling and modication. In Proc. ICASSP, pages I5I8, 2005.
[101] D. Vincent, O. Rosec, and T. Chonavel. Estimation of LF glottal source parameters based
on an ARX model. In Proc. Interspeech, pages 333335, 2005.
Bibliography 112
[102] D. Vincent, O. Rosec, and T. Chonavel. A new method for speech synthesis and trans-
formation based on an ARX-LF Source-Filter decomposition and HNM modeling. In Proc.
ICASSP, volume 4, pages 525528, 2007.
[103] D.J. Wong, J.D. Markel, and A.H. Gray. Least squares glottal inverse ltering from an
acoustic speech wave. IEEE Trans. on Acoustics, Speech and Signal Processing, 27:350
355, 1979.
[104] H. Ye. Voice Morphing using a Sinusoidal Mordel and Linear Transformation. PhD thesis,
Cambridge University Engineering Department, 2005.
[105] H. Ye and S. Young. Perceptually Weighted Linear Transformations for Voice Conversion.
In Proc. EUROSPEECH, pages 24092412, 2003.
[106] H. Ye and S. Young. High Quality Voice Morphing. In Proc. ICASSP, volume 1, pages
I912, 2004.
[107] H Ye and SJ Young. Quality-enhanced Voice Morphing using Maximum Likelihood Trans-
formations. IEEE Audio, Speech and Language Processing, 14(4):13011312, 2006.
A
Recorded stimuli
1. bat
2. back
3. but
4. bait
5. bet
6. Bert
7. beet
8. bit
9. boat
10. bought
11. boot
12. book
13. van
14. this
15. zoo
16. azure
17. hat
18. x
19. thick
20. sat
21. ship
22. We were away a year ago.
23. Early one morning a man and a woman ambled along a one mile end.
113
APPENDIX A. RECORDED STIMULI 114
24. Should we chase those cowboys?
25. When the sunlight strikes raindrops in the air, they act like a prism and form a rainbow.
26. The rainbow is a division of white light into many beautiful colours.
27. These take the shape of a long round arch with its path high above and its two ends
apparently beyond the horizon.
28. There is according to legend a boiling pot of gold at one end.
29. People look but no one ever nds it.
30. When a man looks for something beyond his reach his friends say he is looking for the pot
of gold at the end of the rainbow.
31. The kite did not y over the fence, the ball ew over the fence.
32. The ball did not y over the wall, the ball ew over the fence.
33. The new bike has not been borrowed, the new bike has been stolen.
34. The old bike has not been stolen, the new bike has been stolen.
35. The lock of the door is not broken, the bell of the door is broken.
36. The bell of the bike is not broken, the bell of the door is broken.
37. The red apples were not sour, the green apples were very sour.
38. The green apples were not bitter, the green apples were very sour.
39. Running is not healthier than cycling, walking is healthier than cycling.
40. Walking is not healthier than swimming, walking is healthier than cycling.
41. The pears in the tree are not ripe, the apples in the tree are ripe.
42. The apples in the basket are not ripe, the apples in the tree are ripe.
43. The pipe does not lie in the ashtray, the cigar lies in the ashtray.
44. The cigar does not lie on the plate, the cigar lies in the ashtray.
45. The cook did not go on holiday, the porter went on holiday.
46. The porter did not retire, the porter went on holiday.
47. The tomatoes are not in the barn, the potatoes are in the barn.
48. The potatoes are not in the cellar, the potatoes are in the barn.
49. That village does not have a bad name, that hotel has a bad name.
50. That hotel does not have a good name, that hotel has a bad name.

Adelpozo Thesis

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Adelpozo Thesis

Hochgeladen von

Copyright:

Verfügbare Formate

Voice Source and Duration Modelling

for Voice Conversion and Speech

is the scaled mean fundamental period.

, LSF coefcients, R-parameters

for a particular spectral envelope y, that is

dened in Equation 5.1,

Alessandro. Spectral correlates of glottal waveform models: an analytic

Alessandro, and B. Doval. Spectral correlates of voice open quotient

Das könnte Ihnen auch gefallen