Beruflich Dokumente
Kultur Dokumente
net/publication/323257158
CITATIONS READS
86 2,016
5 authors, including:
Jitong Chen
The Ohio State University
18 PUBLICATIONS 1,147 CITATIONS
SEE PROFILE
All content following this page was uploaded by Jitong Chen on 01 March 2018.
Sercan Ö. Arık * 1 Jitong Chen * 1 Kainan Peng * 1 Wei Ping * 1 Yanqi Zhou 1
Abstract on text (Wang et al., 2017) and speaker identity (Arik et al.,
2017b; Ping et al., 2017). While text carries linguistic in-
Voice cloning is a highly desired feature for
formation and controls the content of the generated speech,
personalized speech interfaces. Neural network
arXiv:1802.06006v1 [cs.CL] 14 Feb 2018
generative model
Speaker embedding Speaker encoder Speaker embedding
Speaker embedding model
Speaker encoder
Speaker identity
Speaker identity Cloning audios model
Cloning audios
Speaker embedding
Speaker embedding
or
Speaker encoder
model
Cloning Multi-speaker Cloning
texts generative model audios
Cloning audios
Speaker embedding
Multi-speaker
Audio generation
Text Audio
generative model
Speaker
Multi-speaker embedding
Text Audio
generative model
Speaker
embedding
Figure 1. Speaker adaptation and speaker encoding approaches for training, cloning and audio generation.
information, we use a multi-head self-attention mecha- fed to multi-speaker generative model to generate audios for
nism (Vaswani et al., 2017) to compute the weights for a target speaker.
different audios and get aggregated embeddings.
3.1. Datasets
2.3. Discriminative models for evaluation
Multi-speaker generative model and speaker encoder model
Voice cloning performance metrics can be based on human are trained using LibriSpeech dataset (Panayotov et al.,
evaluations through crowdsourcing platforms, but they are 2015), which contains audios for 2484 speakers sampled at
slow and expensive during model development. In this 16 KHz, totalling 820 hours. LibriSpeech is a dataset for
section, we propose two evaluation methods using discrimi- automatic speech recognition, and its audio quality is lower
native models. compared to speech synthesis datasets.5
Voice cloning is performed using VCTK dataset (Veaux
2.3.1. S PEAKER CLASSIFICATION
et al., 2017). VCTK consists of audios for 108 native speak-
Speaker classifier determines which speaker an audio sam- ers of English with various accents. To be consistent with
ple belongs to. For voice cloning evaluation, a speaker LibriSpeech dataset, VCTK audio samples are downsam-
classifier can be trained on the set of target speakers used pled to 16 KHz. For a chosen speaker, a few cloning audios
for cloning. High-quality voice cloning would result in high are sampled randomly for each experiment. The sentences
speaker classification accuracy. We use a speaker classifier presented in Appendix B are used to generate audios for
with similar spectral and temporal processing layers shown evaluation.
in Fig. 7 and an additional embedding layer before the
softmax function. 3.2. Specifications
Speaker verification is the task of authenticating the claimed Our multi-speaker generative model is based on the con-
identity of a speaker, based on a test audio and enrolled volutional sequence-to-sequence architecture proposed in
audios from the speaker. In particular, it performs binary (Ping et al., 2017), with similar hyperparameters and Griffin-
classification to identify whether the test audio and enrolled Lim vocoder. To get better performance, we increase the
audios are from the same speaker (e.g., Snyder et al., 2016). time-resolution by reducing the hop length and window size
In this work, we consider the end-to-end text-independent parameters to 300 and 1200, and add a quadratic loss term
speaker verification framework (Snyder et al., 2016) (see to penalize larger amplitude components superlinearly. For
Appendix C for more details of model architecture). One can speaker adaptation experiments, we reduce the embedding
train a speaker verification model on a multi-speaker dataset, dimensionality to 128, as it yields less overfitting problems.
then directly test whether the cloned audio and the ground Overall, the baseline multi-speaker generative model has
truth audio are from the same speaker. Unlike the speaker around 25M trainable parameters.
classification approach, speaker verification model does not
require training with the audios from the target speaker for 3.2.2. S PEAKER ADAPTATION
cloning, hence it can be used for unseen speakers with a For speaker adaptation approach, either the entire multi-
few samples. As the quantitative performance metric, the speaker generative model parameters or only its speaker
equal error-rate (EER) 4 from speaker verification model embeddings are fine-tuned. For both cases, optimization is
can be used to measure how close the cloned audios are to separately applied to each of the 108 speakers from VCTK
the ground truth audios. dataset.
We compare two approaches for voice cloning. For speaker For speaker encoding approach, we train speaker encoders
adaptation approach, we train a multi-speaker generative for different number of cloning audios separately, to obtain
model and adapt it to a target speaker by fine-tuning the the minimum validation loss. We convert cloning audios to
embedding or the whole model. For speaker encoding ap- log-mel spectrograms with 80 frequency bands, a hop length
proach, we train a multi-speaker generative model and a of 400, a window size of 1600. Log-mel spectrograms are
speaker encoder. The estimated speaker embedding is then fed to spectral processing layers, which are composed of
2-layer prenet of size 128. Then, temporal processing is
4
One may change decision threshold to trade-off between false
5
acceptance rate and false rejection rate. When the two rates are We designed a segmentation and denoising pipeline to process
equal, it is referred to as the equal error rate. LibriSpeech, as described in (Ping et al., 2017)
Neural Voice Cloning
Table 1. Comparison of requirements for speaker adaptation and speaker encoding. Cloning time interval assumes 1-10 cloning audios.
Inference time is for an average sentence. All assume implementation on a TitanX GPU.
10
as expected, since there are more degrees of freedom pro-
8 vided to be tuned for that particular speaker. There is a
very small difference in naturalness for speaker encoding
6
approaches when the number of cloning audios is increased.
4 Most importantly, speaker encoding does not degrade the
naturalness of the baseline multi-speaker generative model.
2
Fine-tuning improves the naturalness of speaker encoding
0 as expected, since it allows the generative model to learn
1 2 3 5 10 20 50 100
Number of samples how to compensate the errors of the speaker encoder while
training. Similarity scores slightly improve with higher sam-
ple counts for speaker encoding, and match the scores for
Figure 5. Speaker verification EER (using 5 enrollment audio) for speaker embedding adaptation. The gap of similarity with
different numbers of cloning samples. ground truth is mostly attributed to the limited naturalness
of the outputs, as they are trained with LibriSpeech dataset.
Figs. 4 and 5 show the classification accuracy and equal-
error-rate (EER) for both speaker adaptation and encoding 3.4. Speaker embedding space and manipulation
approaches, obtained by speaker classification and speaker As shown in Fig. 6 and Appendix E, speaker encoder models
verification models. Both speaker adaptation and speaker en- map speakers into a meaningful latent space. Inspired by
coding approaches benefit from more cloning audios. When word embedding manipulation (e.g. via simple algebraic
the number of cloning audio samples exceed 5, whole model operations as king - queen = male - female), we consider
adaptation outperforms the other techniques in both metrics. applying algebraic operations to the inferred embeddings to
Speaker encoding approaches yield a lower classification ac- transform their speech characteristics.
curacy compared to embedding adaptation, but they achieve
similar speaker verification performance. To transform gender, we get the averaged speaker embed-
dings for female and male, and add their difference to a
3.3.2. H UMAN EVALUATIONS particular speaker. For example,
Table 2. Mean Opinion Score (MOS) evaluations for naturalness with 95% confidence intervals.
0.3
Average North American male 4. Related Work
0.2 Average North American female
4.1. Few-shot generative models
0.1
Average male
0.0 Average female Humans can learn most new generative tasks from only a
few examples, and it has motivated research on few-shot
-0.1
generative models.
-0.2 Average British male
Average British female
Early studies on few-shot generative modeling mostly focus
-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6
on Bayesian models. In (Lake et al., 2013) and (Lake et al.,
0.3
2015), hierarchical Bayesian models are used to exploit
Average North American male
compositionality and causality for few-shot generation of
0.2 Average North American female
Average North American characters. In (Lake et al., 2014), a similar idea is modified
0.1 to acoustic modeling task, with the goal of generating new
0.0 words in a different language.
-0.1 Recently, deep learning approaches are adapted to few-shot
-0.2 Average British female
Average British Average British male generative modeling particularly for image generation appli-
-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6
cations. In (Reed et al., 2017), few-shot distribution estima-
tion is considered using an attention mechanism and meta-
learning procedure, for conditional image generation. In
(Azadi et al., 2017), few-shot learning is applied to font style
Figure 6. Visualization of estimated speaker embeddings by
transfer, by modeling the glyph style from a few observed
speaker encoder. The first two principal components of the average
speaker embeddings for the speaker encoder with 5 sample count.
letters, and synthesizing the whole alphabet conditioned on
Only British and North American regional accents are shown as the estimated style. The technique is based on multi-content
they constitute the majority of the labeled speakers in the VCTK generative adversarial networks, penalizing the unrealis-
dataset. Please see Appendix E for more detailed analysis. tic synthesized letters compared to the ground truth. In
(Rezende et al., 2016), sequential generative modeling is ap-
plied for one-shot generalization in image generation, using
a spatial attentional mechanism.
4.2. Speaker embeddings in speech processing For similarity, we demonstrate that both approaches benefit
from a larger number of cloning audios. The performance
Speaker embedding is a well-established approach to en-
gap between whole-model and embedding-only adaptation
code discriminative information in speakers. It has been
indicates that some discriminative speaker information still
used in many speech processing tasks such as speaker recog-
exists in the generative model besides speaker embeddings.
nition/verification (Li et al., 2017), speaker diarization (Rou-
The benefit of compact representation via embeddings is
vier et al., 2015), automatic speech recognition (Doddipatla,
fast cloning and small footprint size per user. Especially for
2016) and speech synthesis (Arik et al., 2017b). In some
the applications with resource constraints, these practical
of these, the model explicitly learns to output embeddings
considerations should clearly favor for the use of speaker
with a discriminative task such as speaker classification.
encoding approach. Methods to fully embed the speaker
In others, embeddings are randomly initialized and implic-
information into the embeddings would be an important
itly learned from an objective function that is not directly
research direction to improve performance of voice cloning.
related to speaker discrimination. For example, in (Arik
et al., 2017b), a multi-speaker generative model is trained Training multi-speaker generative model in this paper is
to generate audio from text, where speaker embeddings are done using a speech recognition dataset with low-quality
implicitly learned from a generative loss function. This audios and limited diversity in representation of universal
approach encourages most parts of the model to be speaker- set of speakers. Improvements in the quality of dataset
independent and shared, while pushing speaker-related in- will result in higher naturalness and similarity of generated
formation into embeddings. samples. Also, increasing the amount and diversity of speak-
ers should enable a more meaningful speaker embedding
4.3. Voice conversion space, which can improve the similarity obtained by both ap-
proaches. We expect our techniques to benefit significantly
The goal of voice conversion is to modify an utterance from from a large-scale and high-quality multi-speaker speech
source speaker to make it sound like the target speaker, dataset.
while keeping the linguistic contents unchanged.
We believe that there are many promising horizons for im-
One common approach is dynamic frequency warping, to provement in voice cloning. Advances in meta-learning,
align spectra of different speakers. (Agiomyrgiannakis & i.e. systematic approach of learning-to-learn while training,
Roupakia, 2016) proposes a dynamic programming algo- should be promising to improve voice cloning, e.g. by inte-
rithm that simultaneously estimates the optimal frequency grating speaker adaptation or encoding into training, or by
warping and weighting transform while matching source inferring the model weights in a more flexible way than the
and target speakers using a matching-minimization algo- speaker embeddings are being used.
rithm. Wu et al. (2016) uses a spectral conversion approach
integrated with the locally linear embeddings for manifold
learning. There are also approaches to model spectral con-
version using neural networks such as (Desai et al., 2010),
(Chen et al., 2014), (Hwang et al., 2015), that are typically
trained with a large amount of audio pairs of target and
source speakers.
5. Conclusions
In this paper, we have demonstrated two approaches for
neural voice cloning: speaker adaptation and speaker en-
coding. We demonstrate that both approaches can achieve
reasonable cloning quality even with only a few cloning
audios.
For naturalness, we demonstrate that both speaker adapta-
tion and speaker encoding can achieve an MOS for natu-
ralness similar to baseline multi-speaker generative model.
Thus, the proposed techniques can potentially be used with
better multi-speaker models in the future (such as replac-
ing Griffin-Lim with WaveNet vocoder as in (Shen et al.,
2017a)).
Neural Voice Cloning
Rezende, Danilo, Shakir, Danihelka, Ivo, Gregor, Karol, Wang, Yuxuan, Skerry-Ryan, R. J., Stanton, Daisy, Wu,
and Wierstra, Daan. One-shot generalization in deep gen- Yonghui, Weiss, Ron J., Jaitly, Navdeep, Yang, Zongheng,
erative models. In Balcan, Maria Florina and Weinberger, Xiao, Ying, Chen, Zhifeng, Bengio, Samy, Le, Quoc V.,
Kilian Q. (eds.), Proceedings of The 33rd International Agiomyrgiannakis, Yannis, Clark, Rob, and Saurous,
Conference on Machine Learning, volume 48 of Proceed- Rif A. Tacotron: A fully end-to-end text-to-speech syn-
ings of Machine Learning Research, pp. 1521–1529, New thesis model. CoRR, abs/1703.10135, 2017.
York, New York, USA, 20–22 Jun 2016. PMLR.
Wester, Mirjam, Wu, Zhizheng, and Yamagishi, Junichi.
Rouvier, M., Bousquet, P. M., and Favre, B. Speaker diariza- Analysis of the voice conversion challenge 2016 evalua-
tion through speaker embeddings. In 2015 23rd European tion results. In Interspeech, pp. 1637–1641, 09 2016.
Signal Processing Conference (EUSIPCO), pp. 2082–
2086, Aug 2015. doi: 10.1109/EUSIPCO.2015.7362751. Wu, Yi-Chiao, Hwang, Hsin-Te, Hsu, Chin-Cheng, Tsao,
Yu, and Wang, Hsin-min. Locally linear embedding for
Sagisaka, Yoshinori, Kaiki, Nobuyoshi, Iwahashi, Naoto, exemplar-based spectral conversion. In Interspeech, pp.
and Mimura, Katsuhiko. Atr µ-talk speech synthesis 1652–1656, 09 2016.
system. In Second International Conference on Spoken
Language Processing, 1992. Zen, Heiga, Tokuda, Keiichi, and Black, Alan W. Statistical
parametric speech synthesis. Speech Communication, 51
Shen, Jonathan, Pang, Ruoming, J. Weiss, Ron, Schuster, (11):1039–1064, 2009.
Mike, Jaitly, Navdeep, Yang, Zongheng, Chen, Zhifeng,
Zhang, Yu, Wang, Yuxuan, Skerry-Ryan, RJ, A. Saurous,
Rif, Agiomyrgiannakis, Yannis, and Wu, Yonghui. Natu-
ral tts synthesis by conditioning wavenet on mel spectro-
gram predictions. 12 2017a.
Shen, Jonathan, Pang, Ruoming, Weiss, Ron J, Schuster,
Mike, Jaitly, Navdeep, Yang, Zongheng, Chen, Zhifeng,
Zhang, Yu, Wang, Yuxuan, Skerry-Ryan, RJ, et al. Nat-
ural tts synthesis by conditioning wavenet on mel spec-
trogram predictions. arXiv preprint arXiv:1712.05884,
2017b.
Snyder, David, Ghahremani, Pegah, Povey, Daniel, Garcia-
Romero, Daniel, Carmiel, Yishay, and Khudanpur, San-
jeev. Deep neural network-based speaker embeddings
for end-to-end speaker verification. In IEEE Spoken Lan-
guage Technology Workshop (SLT), pp. 165–170, 2016.
Taigman, Yaniv, Wolf, Lior, Polyak, Adam, and Nachmani,
Eliya. Voice synthesis for in-the-wild speakers via a
phonological loop. CoRR, abs/1707.06588, 2017.
van den Oord, Aaron, Kalchbrenner, Nal, Espeholt, Lasse,
Vinyals, Oriol, Graves, Alex, et al. Conditional image
generation with pixelcnn decoders. In Advances in Neural
Information Processing Systems, pp. 4790–4798, 2016.
Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit,
Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, Ł ukasz,
and Polosukhin, Illia. Attention is all you need. In Guyon,
I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R.,
Vishwanathan, S., and Garnett, R. (eds.), Advances in
Neural Information Processing Systems 30, pp. 6000–
6010. Curran Associates, Inc., 2017.
Veaux, Christophe, Yamagishi, Junichi, and MacDonald,
Kirsten et al. Cstr vctk corpus: English multi-speaker
corpus for cstr voice cloning toolkit, 2017.
Neural Voice Cloning
Appendices
A. Detailed speaker encoder architecture
Speaker embedding
[batch, dembedding]
Normalize
[batch, Nsamples]
Softsign
Cloning samples attention
FC
[batch, Nsamples]
FC
[batch, Nsamples, dattn]
Multi-head attention
queries
values
keys
FC FC FC
Temporal masking
√0.5
×
× Nconv
Gated linear unit
Convolution
ELU
× Nprenet
FC
Mel spectrograms
Figure 7. Speaker encoder architecture with intermediate state dimensions. (batch: batch size, Nsamples : number of cloning audio
samples |Ask |, T : number of mel spectrograms timeframes, Fmel : number of mel frequency channels, Fmapped : number of frequency
channels after prenet, dembedding : speaker embedding dimension). Multiplication operation at the last layer represents inner product
along the dimension of cloning samples.
Neural Voice Cloning
Figure 8. The sentences used to generate test samples for the voice cloning models. The white space characters / and % follow the same
definition as in (Ping et al., 2017).
where x and y are speaker encodings of enrollment and test audios respectively after fully-connected layer, w and b are
scalar parameters, and S is a symmetric matrix. Then, s(x, y) is feed into a sigmoid unit to obtain the probability that
they are from the same speaker. The model is trained using cross-entropy loss. Table 4 lists hyperparameters of speaker
verification model for Librispeech dataset.
8
Enrollment audios are from the same speaker.
Neural Voice Cloning
PLDA score
SV module FC
pooling
recurrent layer
Conv. xN
block ReLU
batch-nom
2D convolution
Parameter
Audio resampling freq. 16 KHz
Bands of Mel-spectrogram 80
Hop length 400
Convolution layers, channels, filter, strides 1, 64, 20 × 5, 8 × 2
Recurrent layer size 128
Fully connected size 128
Dropout probability 0.9
Learning Rate 10−3
Max gradient norm 100
Gradient clipping max. value 5
15
10
0
1 2 3 5 10 20 50 100
Number of samples
Figure 10. Speaker verification EER (using 1 enrollment audio) vs. number of cloning audio samples.
Neural Voice Cloning
D. Implications of attention
For a trained speaker encoder model, Fig. 12 exemplifies attention distributions for different audio lengths. The attention
mechanism can yield highly non-uniformly distributed coefficients while combining the information in different cloning
samples, and especially assigns higher coefficients to longer audios, as intuitively expected due to the potential more
information content in them.
Figure 11. Mean absolute error in embedding estimation vs. the number of cloning audios for a validation set of 25 speakers, shown with
the attention mechanism and without attention mechanism (by simply averaging).
Coefficients
Coefficients
Coefficients
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8
Length (sec) Length (sec) Length (sec) Length (sec)
Coefficients
Coefficients
Coefficients
Coefficients
Coefficients
Coefficients
Coefficients
Coefficients
Coefficients
Figure 12. Inferred attention coefficients for the speaker encoder model with Nsamples = 5 vs. lengths of the cloning audio samples. The
dashed line corresponds to the case of averaging all cloning audio samples.
Neural Voice Cloning
Figure 13. First two principal components of the inferred embeddings, with the ground truth labels for gender and region of accent for the
VCTK speakers as in (Arik et al., 2017b).
Neural Voice Cloning
F. Similarity scores
For the result in Table 3, Fig. 14 shows the distribution of the scores given by MTurk users as in (Wester et al., 2016). For
10 sample count, the ratio of evaluations with the ‘same speaker’ rating exceeds 70 % for all models.