Sie sind auf Seite 1von 6

Development of a Syllabicator for Yorùbá Language

Kumolalo F. O.,1 Adagunodo E. R. and Odejobi O. A

1
sobusola@oauife.edu.ng

Department of Computer Science and Engineering, Obafemi Awolowo University, Ile-Ife, Nigeria

Abstract
Few languages in Africa have resources for natural language processing. Despite the fact that Yorùbá language has a speaker-
base of over 30 million people throughout the world, it still numbered among under-resourced languages of the world. This is
due to lack of tools and resources targeted toward automatic processing of the language. This work has identified one of such
resources need for Yorùbá as a tonal language. This is the tool that takes a word and returns the syllables that the word is
composed of. The approach used is rule-based paradigm. The performance of the system when evaluated on a text size of
90622 words is 99.993% accuracy of returning correct syllables from a given word.

Keywords: Syllabification, rule-based, resources, tokens, words

Introduction
Yorùbá Language is classified as part of the Yoruboid sub group of the Niger-Congo group of languages [1]. It has an
estimated first language speakers of over 30 million [2] spread across Nigeria (where it is an official language), and also
natively spoken by over a million in Benin Republic [3], Togo, and emigrants to United Kingdom, United States, Canada and
other European and Asian Countries. It is used in some other forms in Liberia and Sierra Leone. Diminutive versions of it
with names like Nago, and Lucumi are used in religious ceremonies in Brazil, Cuba and Trinidad and Tobago amongst others
[4]. Although there are various dialects of the Yorùbá language, the one used for official correspondence, taught in schools
and used in news and television broadcast is referred to as the Standard Yorùbá. [5]

Development of a writing system for the language started early in the mid 19th century. This was done by missionaries
notable among who are Kilham and Bishop Ajayi Crowther. By 1849 the first Yorùbá Grammar and dictionary had been
produced [6]. Nevertheless, Yorùbá is classified as a resource-scarce language[7]. "Resource-scarce languages are languages
for which few digital resources exist; and thus, languages whose computerization poses unique challenges"[8]. These
challenges according [8] include unavailability of: electronic lexica, standardized electronic corpus, and tools such as Part-of-
Speech (POS) tagger, chunkers for natural language processing. It also include the lack of the political and the financial
strength to cause developers and decision-makers in the computer industry to give attention to the needs of the language.

The field however, is not devoid of activities toward language technology researches and products such as[5,6.7,8,9,10].
More resources are still needed to meet the need of natural language processing for this language. In this work, we focused
on the development of a tokenizer that will take a stream of text and break it into a token of syllables. The rest of this work is
divided as follows: Tokenizers and their functions and a review of related work; some possible approaches to developing a
syllable tokenizer henceforth referred to a syllabicator for Yorùbá were outline and implementation process of the syllabicator
was presented. The paper was concluded by performance evaluation of the syllabicator.

1
Tokenizers and Approaches to Construction

Tokenizers are common preprocessing tools in natural language processing. There are various kinds of tokenizers. The most
common are those that take stream of text and return the words in the text as token. This is the most common sense in the
usage of the word tokenizers. In this work, we focused on the development of a syllabicator that will take a stream of text and
break it into a token of syllables. The need for a tokenizer arose because most “language processing tasks are formulated as
annotations and transformations involving tokens” [11]. It is the constituent parts that are referred to as tokens. Natural
languages consist of characters made up from alphabets or letters, digits, symbols or marks. These characters combine
together in varying proportion according to different rules or grammar governing that language. Tokenizing is therefore the
separation and grouping these characters found in a text according to a defined pattern for a particular purpose. It is must be
realized that tokenizing does not transform the data. The most commonly tokens are word tokens. Tokenizing into word
tokens is simple if the language uses scripts where the words are separated by spaces in between consecutive words.
Otherwise it becomes a problem of determining word boundary for example Chinese script. Another version of the problem
is where a word can consists of more than one space separated unit like the Vietnamese language when written with Latin
Script. Example of application that might require word-level tokenization is Part of Speech (POS) tagging.
However, other units of language exist. One of them is tokenizing a character level. The complexity of this one does not vary
from language to language. We also have sentence-level tokenization where the task is to determine sentence boundary within
a text document. Under a broad interpretation of the term, chunking can also be classified as tokenizing. Text can also be
tokenized into phonemes (if phonotypy was used) and syllables.

Review of Related Previous Works

Hu [12] developed Text Statistics Tool Box that attempted among other tasks to count the number of syllables in a
characteristic list (that is a sentence) using English syllable rules listed in [13] and others. The rules utilized in constructing
this toolbox made accuracy difficult as the system can easily misclassified syllables. Another set of efforts at counting the
syllables in an English sentences are shown on [14]. None of these is guaranteed to give very good results and none was
designed with the intent of producing tokens.
The OAK System was developed by the Proteus Project [15] of the Department of Computer Science of the New York
University as a “total analyzer for English”. The tokenizer tokenizes into words or wordforms but not into syllables. The
tokenizer in the Stanford Natural Language Processing (NLP) Toolbox of the Stanford NLP Group[16] also tokenize into
words and wordforms. The English language tokenizer is a Finite State Machine tokenizer based on hand-written rules. The
OpenNLP tools is a set of Java-based NLP tools that perform sentence detection and tokenization among other tasks for two
European languages and one Asian language [17]. The tokenizer which is for tokenizing sentences into words followed the
the token style for the English Penn-Treebank. NL Tokenizer [18] for English Language tokenizes a text into paragraph,
sentences, words (including simple wordform, abbreviations, numbers, etc) that can work in standalone mode or as part of
another module. [19] reported the performance of thirteen tokenizers but all of them return words and non returned syllables.
Ngugi et al [20] is the only work closely related to this work. They developed a Swahili Text-To-Speech application that has a
syllabification module (SM) embedded. The SM takes in a text consisting of strings of Swahili words and output string of
syllables to be fed as input into a Digital Signal Processor that synthesizes the speech. [20] also used a rule-set however
simpler due to the simpler syllable structure of Swahili.

Design of the Yorùbá syllabicator

Possible approaches toward accomplishing this task include the statistical and rule-based approaches.
The statistical approach can either rely on pure statistical algorithm or be example-based approach. The challenge that this
task would confront using this approach is the requirement of an existing and extensive corpus or word-list. An example-
based approach would also depend on the word already existing with the corpus of text being used for prediction. Moreover
the syllabification of a word is deterministic and not probabilistic and so a probabilistic approach will likely perform less than
a deterministic approach.
A declarative rule-based approach is therefore proposed for the syllabification of Yorùbá words. The declarative rule-based
approach is one in which just one rule fires deterministically will the others are ignored as they do not apply to the given
situation.
The methodology used in this work is to first determine the length of the word and thereafter apply a set of rules that takes
into consideration the characters in the word and the order or arrangement of the characters within the word.

2
Model of Yoruba Syllables and Syllabification Rule-set Model

Model of Yoruba Syllables

Yorùbá language has three syllables structures: Vowel, Consonant-Vowel and syllabic nasal. To expatiate more on this: there
are seven basic Yoruba vowels denoted in Figure 1 as V, five nasal vowels denoted Vn and two syllabic nasal denoted as N in
Figure 1. Total number of Yoruba consonants is 18. One of them is diagraph denoted as D and the rest are denoted as C in
Figure 1. This gives room for nucleus only and onset and nucleus syllables.
Using the above fact, we utilized four different possible word-lengths: greater than three, equal to three, equal to two and
equal to one. The choice of these word-lengths arose from the maximum to the minimum possible numbers of characters that
could form a syllable in Yorùbá. DVn syllables have four characters, CVn and DV syllables have three characters each while
both CV and Vn have two characters each. V and N syllables have only one character per syllable.

Figure 1: Yoruba Syllable Structure.

Syllabification Rule-set Model


The syllabicator consist of following modules: file-handling module; multiple diacritic handling module; word-tokenizer
module; word language seperator module and the syllabication module. The file is read in, necessary diacritic transformation
performed and tokenized into words. The word is then parsed to determine whether the structure is Yorùbá before finally
passed to the syllabication module.What is describe here is the process of syllabication.
For each of the given word-lengths, a set of rules were designed to cover the possible combinations of characters and the
syllables that could be formed from them. The word-length greater than three is labeled as “case 4” with 8 rules for
determining the syllable boundary and extracting the syllable, while the word-length equal to three is labeled as “case 3” with
10 rules for syllable boundary determination and extraction. “Case 2” is for word-length of two and it has four rules while the
last: “case 1” has only one rule.
The rule-set is put as a cascaded body of rules such that “case 4” can extract varying syllable lengths of four to one while
“case 3” extract syllables of length three to one and so on. Once a syllable is extracted, the length of the remaining string is
recalculated and the process repeated until length is zero. If no syllable can be extracted from a word or string of characters
of a given word-length, the syllabification is reported failed. Figure 2 showed a diagrammatic representation of processing in
the syllabication module. Failure to syllabify could be due to any of the following reasons: when a character is present in the
string that is not defined for the Yorùbá alphabet. Even if all the characters are defined for Yorùbá alphabet, the syllabification
would still fail if a sequence exists in the string that is not defined for Yorùbá word pattern.

3
Figure 2: Syllabification Rule-set Process

4
Implementation and Performance Evaluation
The syllabicator was implemented in Java TM Software Development Kit 6. The choice of java as programming language is
based on the platform independence and extensive Unicode facility. This prototype is currently Command Line Interface
(CLI) based. It takes as input a text file with UTF-16LE encoding and a “.txt” extension. UTF-16LE encoding was adopted to
enable the file correctly represent non-ASCII characters.
The syllabicator component was fine-tuned successfully after testing with 12 text files of different sizes. A table showing the
files with their statistics is shown in figure 1 below. The statistics for the syllable count were generated from the generated
syllable files generated from the syllabification suite.

Table 1: Yorùbá text files with Word counts, nd generated Syllable counts
Documents Word count Syllable count
Document 1 426 704
Document 2 427 748
Document 3 521 895
Document 4 702 1291
Document 5 839 1565
Document 6 1052 2522
Document 7 1721 3216
Document 8 1884 3193
Document 9 2493 4908
Document 10 3421 6421
Document 11 4027 6709
Document 12 6087 12918

Conclusion
This work presented an effort in contribution to the toolkit for natural language processing of Yorùbá language which is one
of the many under-resourced languages. The approach used was rule-based due to the fact of insufficient resources for using
supervised and unsupervised learning approaches. The recorded performance of the system was 99.99%. The error rate was
measured as the percentage of words wrongly syllabicated or not syllabicated (6 out of 90622). This tool is presently targeted
to be used as a pre-processing tool for diacritic restoration. We expect that it will find used in speech processing technology
for Yorùbá as well as other pre-processing activities.

Biography of Authors.

Kumolalo F. O. (MSc. OAU) is a lecturer in the Department of Computer Science and Engineering, Obafemi Awolowo
University, Ile-Ife, Nigeria. He is currently on his PhD research which is focussed on Natural Language Processing. He has
publication in International and local Jounals.
Adagunodo E. R. (PhD, OAU) is a Professor of Computer Science in the Department of Computer Science and Engineering,
Obafemi Awolowo University, Ile-Ife, Nigeria. He has published in both International and local Journals. He is currently the
Head of the Department and has supervised several postgraduate students.
Odejobi O. A.(PhD, Aston) specialized in Artificial Intelligence and Computational Linguistics. He is an international
scholar who has publications in international and national journals. He is a Senior-Lecturer in the Department of Computer
Science and Engineering, Obafemi Awolowo University, Ile-Ife, Nigeria.

References
[1] Lewis, M. Paul (ed.), 2009. Ethnologue: Languages of the World, Sixteenth edition. Dallas, Tex.: SIL International.
Online version: http://www.ethnologue.com/. Accessed February 12, 2010
[2] Federal Republic of Nigeria. (2007). Official Gazette of Federal Republic of Nigeria No 4 Volume 94, Annexure A.
Retrieved from http://www.population.gov.ng/pop_figure.pdf on 08/03/2010
[3] The Central Intelligence Agency (2009) CIA - The Word FactBook – Benin. Retrieved from
https://www.cia.gov/library/publications/the-world-factbook/geos/bn.html on 08/03/2010
[4] Awoyale Y. (2008). Global Yoruba Lexical Database v. 1.0 Linguistic Data Consortium, Philadelphia. Retrieved on
10/06/2008 from http://www.ldc.upenn.edu/Catalog/docs/LDC2008L03/Global_Yoruba_Lexical_Database.pdf

5
[5]Odejobi, O. A. (2005). A computational model of Prosody for Yorùbá Text-to-Speech Synthesis. Unpublished PhD Thesis,
Aston University, Aston.
[6]African Studies Institute (2010) Yorùbá Online, African Studies Institute university of Georgia
Retrieved April 24, 2010, http://www.africa.uga.edu/Yoruba/yorubabout.html
[7] G. De Pauw, P.W. Wagacha & G-M de Schryver Automatic Diacritic Restoration for Resource-Scarce Languages. in V.
Matoušek & P. Mautner (eds.). 2007. Text, Speech and Dialogue, 10th International Conference, TSD 2007, Pilsen, Czech
Republic, September 3-7, 2007, Proceedings (Lecture Notes in Artificial Intelligence (LNAI), subseries of Lecture Notes in
Computer Science (LNCS), Vol. 4629): 170–179. Berlin: Springer-Verlag.
[8]Areces C.(2006, January 11) [PVS] CFP:Resource-Scarce Language Engineering
Email archive from author retrieved from http://pvs.csl.sri.com/mail-archive/pvs/msg02605.html
[9] Abeokuta.org (2010). Yorùbá gbode: Computer typing in Yorùbá language on Facebook.
http://www.abeokuta.org/yoruba/?page_id=235 accessed January 13, 2010
[10] Scannell, K. P. (2010). Statistical Unicodification of African Languages. Submitted for presentation at LREC 2010
Retrieved on February 14, 2010 at http://borel.slu.edu/pub/lre.pdf
[11] Bird S., Klein E. and Loper E.(2006) Introduction to Natural Language Processing (DRAFT). University of
Pennsylvania.
[12] Hu Cheng (2003) Text Statistics ToolBox For Natural Language Processing. Retrieved on May 24 2010 from http://
www.ai.uga.edu/mc/pronto/Hu.pdf
[13] Doyle C. (2003) Phonics syllables and accent rules. Retrieved on May 24, 2010 from
http://english.glendale.cc.ca.us/phonic.rule.html
[14] StackOverflow (2009) Ruby, Count Syllables. Retrieved on May 13, 2010 from
http://stackoverflow.com/questions/1271918/ruby-count-syllables
[15] Proteus Project (2004) OAK System (English Sentence Analyzer). Department of Computer Science, New York
University, NY. Retrieved on May 24, 2010 from http://nlp.cs.nyu.edu/oak
[16] Stanford University Natural Language Processing Group(2009) The Stanford Natural Language Processing FAQ.
Retrieved on May 25, 2010 from http://nlp.stanford.edu/software/parser-faq.shtml
[17] OpenNLP (2010) The OpenNLP. Retrieved on May 24, 2010 from http://opennlp.sourceforge.net/
[18] Torbjörn Lager (n.d) NL Tokenizer. Retrieved on May 25, 2010 from
http://www.ling.gu.se/~lager/Oz/FlexTokenizer/index.html
[19] He Y., Kayaalp M. (2006) A Comparison of 13 Tokenizers on MEDLINE. Technical Report LHNCBC-TR-2006-003.
Retrieved on August 09, 2010 from www.lhncbc.nlm.nih.gov/lhc/docs/reports/2006/tr2006003.pdf
[20] Ngugi K., Okelo-Odongo W., Wagacha P. W.( 2005) Swahili Text-to-Speech System. African Journal of Science and
Technology (AJST) Science and Engineering Series Vol. 6, No. 1, pp. 80 – 89.

Das könnte Ihnen auch gefallen