Sie sind auf Seite 1von 6

COMBINING SYLLABLE AND DIPHONE UNITS FOR A

CORPUS BASED TAMIL TEXT TO SPEECH SYNTHESIS


SYSTEM
Kiruthika S1 and Krishnamoorthy K2
1 Department

of CSE, King College of Technology,


Namakkal, Tamil Nadu, India

2 Department

of CSE, Sudharsan Engineering College


Pudukkottai, Tamil Nadu, India

Abstract This paper describes the construction and development of Tamil Text-To-Speech (TTS)
synthesis system, by combining syllable and diphone units. Initially, a phone based TTS is made.
Later, a monosyllabic word cluster unit TTS is made. It is determined that the standard of the
synthesized sentences will improve if polysyllabic word units are used (when the suitable units are
available), since the consequences of co-articulation are going to be preserved in such a case. Hence,
we tend to engineered Tamil TTS with syllabic units, which contain cluster units of quite one sort
(monosyllable, bi-syllable and tri-syllable). However Polysyllable alone failed to bring the TTS
betterment in following areas like sentence termination, Scientific notations, website link and email
addresses, where the lagging fields can be effectively processed by units of diphone. This research is
bringing utmost effective concatenation of prosody through combining syllables and diphone. The
words (n-1) of sentence can be processed by polysyllable for more perfection, whereas the (nth) end of
sentences, scientific notations, web site link and email address can be effectively processed by
diphones. Implementation of both syllables and diphones in Tamil speech synthesis need two different
corpus tables. Preliminary listening tests indicated that the combination of syllabic word and diphone
TTS has higher quality.
Diphones
comparatively
Keywords: Speech synthesis system, Text to speech (TTS),
Prosody,are
Diphone,
Syllable. bigger units than
1. Introduction:
phones. There are about thousand to two
thousand diphones found in Tamil language.
There are numerous techniques for the
Unlike phones, they do not have allophonic
improvement of a speech synthesis system. One
variations. i.e., each diphone has only one
of the known approaches for speech recognition
instance of pronunciation. Diphone concatenation
is that the Hidden Markov Model (HMM). The
can produce a reasonable quality speech. A single
phoneme types, syllable patterns, and inflectional
example of each diphone is not enough to
characteristics of a language decide the kind of
produce good quality speech. Moreover, diphonethe technique to be used for synthesis. The unique
based synthesizers need elaborate prosody rules
characteristics of a language are analyzed from
to produce natural speech. Diphones cannot
the order of prevalence of phonemes, syllable
capture co-articulation better than recent
patterns and words that comprise the language. A
methodologies. As given in the [4] existing
statistical approach is needed for selecting which
papers concatenation points are comparatively
class of units to be used for Speech Synthesis.
more, so it needs large size of database to store
The statistical language model helps to detect the
corpus data.
existence of phones, syllable patterns and words
in Tamil language.
3. Issues found in Syllables:
2. Issues found in diphones
A diphone is defined as two connected half
phones and describes the transition between two
phones by starting in the middle of the first phone
and ending in the middle of the second phone. It
describes the coarticulation effects and minimizes
the discontinuities at the concatenation points.

Tamil language is syllable oriented, where


pronunciations are mainly based on syllables. A
Syllable can be the best unit for Tamil language
Speech synthesis systems. Intelligible speech
synthesis is possible for Tamil language with
basic unit as syllable. Though the number of
syllables is larger in comparison to phones or
diphones, it can describe co-articulation better

than phones. The concatenation points relatively


decrease when syllable is used. Syllable
boundaries are characterized by regions of low
energy. The general format of a Tamil language
syllable is C*VC*, where C is a consonant, V is a
vowel and C* indicates the presence of 0 or more
consonants. This may not bring a proper natural
language speech synthesis since syllable
concentrate only in vowels and consonants.

attempt on data-driven modeling of prosodic


phrase boundary prediction for the Tamil
language. The Speech synthesis system shows its
naturalness when the corpus is annotated with
prosodic information. Prosody modelling is
subdivided into modelling the following
constituents of prosody - phrasing, duration,
intonation and intensity. A well-designed speech
corpus, annotated with various levels of prosodic
information is used.

Fig : Tamil speech synthesis system


Fig : Tamil speech synthesis system
Individual letter parsing done in phoneme level
needs to be implemented in end of every sentence
to pronounce the sentence termination naturally.
Scientific notations, website link, email address,
stress notes are to be processed by diphone units
and need to be concatenated with the already
processed syllable unit.
4. Prosody Tagging:
Prosodic phrasing is an important and more
difficult a problem for Tamil language, as the
Tamil language scripts use very little or no
punctuation. This research is a preliminary

Based on the performance on test data, the m


The corpus is analyzed automatically to create a
prosodic model which is then made to synthesize
a training data set, following which the test data
set is evaluated. Based on the performance on the
test data, the models are then improved. The
syllables have sufficient duration information as
it improves the quality of synthetic speech when
used as a duration model. Thus syllables are
identified as the best-suited processing units for
Tamil language Speech synthesis and diphones
are identified as best-suited processing units for
sentence to be ended naturally.

fig: A General Text to speech approach

5. Corpus entry for syllable and diphones:

6. Proposed system:

Syllable selection is a great issue have to be dealt


with when selecting the text-corpus. The most
frequent and unique syllables are covered by an
optimized prompt-list, that was selected for
recording the speech corpus for Tamil. A rule
based parser was used to generate syllables from
UTF-8 text. The two types of speech data are
manually labelled at syllable boundaries and
using Ergodic Hidden Markov Models [6]. The
first method resulted in a number of labelling
errors, while the latter required a large amount of
training data, to mark syllable boundaries
accurately. To resolve these issues, the speech
data is labelled using a semi- automatic labelling
tool developed at IITM, based on segmentation
and identification of syllable units done using
group delay function and vowel onset point. A
separate database has to be maintained for
Diphones. Diphones are relatively bigger units
than phones. With the available data [10] there
are about 1000 to 2000 diphones found in Tamil
language. Unlike phones, they do not exhibit
allophonic variations. i.e., each diphone has only
one instance of pronunciation. To limit the size of
database, diphone entries are used only for last
entries of sentences to end the speech naturally
and to pronounce punctuations, scientific
notations, email address and website links
naturally and descriptively.

The Unit selection paradigm uses the recorded


speech corpus to extract the syllable units to
synthesize speech. With many realizations of the
same syllable being present, clustering techniques
are used to select the area where diphone are to
be implemented. An acoustic join cost is used as
a measure to select an optimal path through these
candidate sets. If speech processed by diphones
in the above said areas, the speech would be
natural and descriptive.
Identity of neighbouring units.
Position of syllable in the word.
Syllable at a phrase boundary.
7. Syllable and diphone clustering:
The present clustering [3] process was unable to
capture the gross acoustic properties of the units
using the default feature set. A better linguistic
feature vector set has to be used, to capture the
complex acoustic properties of a syllable and
diphone. On analyzing the speech corpora, it was
seen that phrase boundaries had a major role in
fluent connected speech. Tamil languages lack
punctuations in general. Phrase boundaries and
intra phrase prosodic patterns [4] using in current
scenario help in understanding an utterance. To
overcome this issue a separate corpus database
was used to process diphone units. Initially the

Fig: Proposed system architecture

text should be morphologically divided in to valid


tokens. This would help us to identify the syllable
occurrence and sentence completion before
processing. The tokens will be separated and
entered in the appropriate fields in the symbol
table. Initially database have been designed with
two separate blocks one for storing syllable and
other for storing diphones.
First syllable occurrence in every sentence have
been monitored and reported. The (N-1) words
syllables are taken and stored in syllable
database. The Nth words are processed by diphone
methodology and the units will be stored in
diphone database. The need for selecting the
phonetically and prosodically best units for
synthesis needs clustering the units from both the
database. An acoustic distance measure is defined
to measure the distance between two units of the
same type. Factors concerning prosodic and
phonetic context are evaluated to form cluster
units within a unit type. A decision tree is built
based on questions concerning the phonetic and
prosodic aspects of the grapheme. Ultimately the
leaves of the decision tree are the list of database
units that best suit the required aspects. At the

time of synthesis, for each target, the appropriate


decision tree is used to find the best cluster of
candidate units. A search is then made to find the
best path through the candidate units.
Pruning is performed to remove spurious atypical
units which may have been caused by
mislabelling or poor articulation in the original
recording. It also removes those units which are
so common that there is no significant distinction
between candidates. Databases were tested with
this clustering method in hand. The method
produced both extreme high quality examples and
extremely low quality ones. Minimizing these bad
examples was the important target.
Clustering includes pre-clustering works that tags
the syllables as begin, middle and end, depending
on the occurrence of the syllable in the word.
Further clustering is performed by tagging the
syllables based on type of the syllable (v, c*v,
vc*, c*vc*) and nature of the constituent vowels
and consonants. Syllables of the same type were
clustered using features like word length of the
phrase, relative position of the syllable in the
phrase, relative position of the parent phrase and
the features of the preceding and the following
syllables in the phrase. Using the feature set and

the acoustic distance measure, the decision tree


was built for each of the unique syllable in the
database.
Whereas the end words are taken from diphone
database for natural and descriptive ending while
read. A proper punctuation notes will be given
clearly by the diphone units while reading digits,
email addresses and scientific notations.
Questions were used at the nodes to find the best
set of candidate syllables. Morpheme tags are
used for phrase prediction. This technique has
improved the quality of the synthesized speech to
a greater extent.
7. Conclusions:
In this paper, the issues in developing Speech
corpus for Tamil language are analyzed and
surveyed. The issue that found in natural ending
of speech can be resolved with new approach by
incorporating both diphone and syllable units.
The various methodologies used to develop
Speech corpus for Tamil language significantly
shows the need for Prosody modelling to obtain
high quality intelligent speech. The appropriate
speech unit for Tamil language is the syllable and
diphone for natural ending, which is the good
constituent of prosodic features of the language.
Selection of the appropriate speech unit and
annotating with necessary prosodic information
play a vital role in the development of Speech
synthesis.
References:
[1] M. Nageshwara Rao, Samuel Thomas, T.
Nagarajan and Hema A. Murthy, Text-to-speech
synthesis using syllablelike units, in National
Conference on Communication, Kharagpur,
India, Jan 2005, pp 277-280.
[2] R. Thangarajan, A.M. Natarajan and M.
Selvam, Word and Triphone Based Approaches
in Continuous Speech Recognition for Tamil
Language, in WSEAS Transactions on Signal
Processing, Issue 3, Volume 4, March 2008.
[3] Alan W. Black and P. Taylor, Automatically
clustering similar units for unit selection in
speech synthesis, Proc. EUROSPEECH 97,
Rhodes, Greece, 1997, Vol. 2.
[4] Youngim Jung, Hyuk-Chul Kwon,
Consistency Maintenance in Prosodic Labeling
for Reliable Prediction of Prosodic Breaks, in

the Proceedings of the Fifth Law Workshop


(LAW V), Portland, Oregon, 23-24 June 2011.
[5] Vinodh M Vishwanath, Ashwin Bellur, Badri
Narayan K, Deepali M Thakare, Anila Susan,
Suthakar N M and Hema A Murthy,Using
Polysyllabic units for Text to Speech Synthesis in
Indian languages, Proceedings of National
Conference on Communication (NCC),pp.1-5,
29-31 Jan. 2010
[6] Kishore Prahallad, Arthur R Toth, Alan W
Black, Automatic Building of Synthetic Voices
from Large Multi- Paragraph Speech Databases,
in Proceedings of Interspeech, Antwerp, Belgium
2007.
[7] Anumanchipalli Gopalakrishna, Rahul
Chitturi, Sachin Joshi, Rohit Kumar, Satinder
Singh, R.N.V Sitaram and S.P. Kishore,
"Development of Indian Language Speech
Databases for Large Vocabulary Speech
Recognition
Systems",
Proceedings
of
International Conference on Speech and
Computer (SPECOM), Patras, Greece, Oct 2005.
[8] T.Jayasankar, R.Thangarajan, J.Arputha
Vijaya Selvi, Automatic Continuous Speech
Segmentation to Improve Tamil Text-to-Speech
Synthesis, in International Journal of Computer
Applications (0975 8887), Volume 25 No.1,
July 2011.
[9] S. Saraswathi and T.V. Geetha, Design of
language models at various phases of Tamil
speech recognition system, International Journal
of Engineering, Science and Technology Vol. 2,
No. 5, 2010, pp. 244-257.
[10] Kiruthiga S, Krishnamoorthy K, Design
Issues in Developing Speech Corpus for Indian
Languages, 2nd International Conference on
Computer Communication and Informatics, Jan
2012, Vol. 2, 978-1-4577-1581-5.
[11] Samuel Thomas, M. Nageshwara Rao, Hema
A. Murthy and C.S. Ramalingam, Natural
sounding TTS based on syllable-like units, in the
Proceedings of the 14th European Signal
Processing Conference, Florence, Italy, Sep 2006.
[12] G. L. Jayavardhana Rama, A. G.
Ramakrishnan, R.Muralishankar and R Prathibha,
A Complete Text-To-Speech Synthesis System
In
Tamil,
in
0-7803-7395-2/02,IEEE
proceedings 2002.
[13] Ashwin Bellur, K Badri Narayan, Raghava
Krishnan K, Hema A Murthy, Prosody
Modeling for Syllable-Based Concatenative
Speech Synthesis of Hindi and Tamil,

DOI:10.1109/NCC.2011.5734737,
proceedings 2011.

IEEE

Kiruthiga S is an Assistant Professor in King


College of Technology, Namakkal. She obtained
her Master of Engineering Degree in 2006 under
Anna University Chennai. She completed her
Bachelor of Engineering Degree under
Bharathiyar University in the year 2003. She has
worked as Lecturer in Sona College of
Technology Salem for 4 years. Her research
interest includes Text to Speech Synthesis system
and other Natural Language Processing
applications.
Krishnamoorthy K is a Professor in Sudharsan
Engineering College, Pudukkottai. He is one of
the recognized research supervisors in Anna
University. He obtained his Ph. D degree in 2007
and Master of Engineering Degree in 2003 both
under Dayananda Sagar University, Bangalore.
He has an experience as an academician for a
span of a decade. His research interest includes
Natural Language Processing and Digital Image
Processing.

Das könnte Ihnen auch gefallen