You are on page 1of 12

International Journal of Advanced Computer Science, Vol. 3, No. 8, Pp. 416-427, Aug., 2013.

Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser
Vera Vasilvski, Leonor Scliar-Cabral, & Mrcio Jos Arajo
Manuscript
Received: 21,Apr., 2013 Revised: 26,May, 2013 Accepted: 5,Jul., 2013 Published: 15,Jul., 2013

Keywords
Brazilian Portuguese, Structure, Patterns, Phonology, Syllable.

Abstract This paper presents the distribution of Brazilian Portuguese phoneme patterns, according to an automatic grammar rules-based grapheme to phoneme converter. The software Nhenhm was used for data treatment: written texts were decoded into phonologic symbols, forming a corpus subjected to statistical analysis. Results support the high level of predictability of Brazilian Portuguese phonemes distribution, consonant-vowel structure being considered the canonic syllable pattern, as well as the stress pattern distribution 'CV.CV. The efficiency of a grapheme to phoneme converter entirely based on rules is also proven. These results are displayed and discussed, as well as some aspects of Nhenhm converter and parser.

discussion. Nhenhm has supplied all transcriptions used in this article.

2. Spoken and Written Language


Science and also History [1] state that the oral verbal language develops spontaneously whenever traces of humanization are found, whereas the written language is an invention, the intensive and systematic learning of which is necessary in most cases [3]. Linguistic evolution is not just a fact of phonological and phonetic change, however, changes often start as pronunciation modifications [1]. Consequently, oppositions fade and disappear, causing homonyms, which must be avoided, so, new words are introduced to avoid ambiguity of signs [4]. Languages are in perpetual change, although showing an apparent repose. The distance between the oral and the written systems, being the last one conservative and subject to literary traditions, becomes increasingly high. One or more letters (graphemes) represent the phonemes, in alphabetic systems. Those units belonging to the second articulation distinguish meaning in writing, but this representation is not a one-to-one, by virtue of the distance between the oral and the written systems already mentioned. Another divergent principle also occurs: the etymological. Since many spellings are based upon etymological origin [3] writing does not represent the oral system faithfully. Both spoken and written language has its own laws and ways. A. Phonetics and Phonology While Phonetics is concerned with describing speech sounds (phones) from the point of view of their articulation, perception and physical properties, Phonology studies the phonemes of a language, that is, classes of sounds, abstractly represented in the minds of a linguistic community. In this way, phonemic transcription is broad (general), covering all possible phonetic variations of each phoneme. The aim of Phonology is deep invariance, while Phonetics searches surface variations. There are many schools of Phonology, the first one was the Prague Circle, which introduced the functionalist approach, meaning, in this case, that only phonetic differences which cause differences of meaning are relevant. Perception of those differences is a psychic process and implies disregarding any similar phonetic difference which does not provoke a different meaning.

1. Introduction
The challenging problem of how alphabetic systems represent the phonology of a given language [1] is the issue here discussed, illustrating it with empirical evidence, based on statistical analysis of the distribution of Brazilian Portuguese phonemes and syllable structure. In addition, questions dealing with prosody are also addressed, with some comments about the spelling agreement, signed in 2009 by seven countries where Portuguese is the official language. This agreement, the goal of which is to standardize the Portuguese spelling, will be probably effective in 2016. The patterns presented were obtained using the software Nhenhm [2], from the analysis of an automatic grammar rules-based grapheme to phoneme converter of Brazilian Portuguese written texts. The program was also improved to become a syllable parser. The presentation is preceded by a description of the relation between the Portuguese written system and the phonological one and the main problems with which the programmers had to deal to find optimal solutions for writing the algorithms. Some of the principles of the Portuguese spelling system together with some of the theories that guided the converter construction support the
This work is supported by CAPES, entity of the Brazilian government for the qualification of human resources. Vera Vasilvski, Federal University of Santa Catarina (UFSC) (seread@hotmail.com); Leonor Scliar-Cabral, UFSC (1sc@th.com); Mrcio Jos Arajo, Federal Technological University of Parana (UTFPR) (marcomjapr@gmail.com)

Vasilvski et al.: Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser.

417

Phonology makes abstraction of the physical properties of sounds, which are the field of Phonetics. Quoting Glossematics, Phonetics studies the expression of sounds (substance of sounds in their multiplicity and variation), and Phonology studies the form (relations, classes, abstract nature, which takes place in the mind) [4]. Since the alphabetic principles are based on the phoneme representation, any automatic program must depart from the phonological description of the respective language, which is the case of the Brazilian Portuguese phonological system, here transcribed. B. Brazilian Portuguese spelling system We will present and discuss here some of the most important rules regarding the spelling system. Portuguese is a syllable-stressed language, i.e., the vast majority of Portuguese words has stressed syllable, leaving aside clictics, which are only a few, but are the most frequently used (for instance, articles, the majority of prepositions, some pronouns and conjunctions and accusative pronouns). However, the stressed syllable is not graphically signaled for the most frequent stressed words (the ones which receive stress on the penultimate syllable) since Occams razor principle was adopted, graphically registering only the stress of less frequent stressed words. Graphically signaling stress is a powerful hallmark for the reader, because it guides him/her to match the written word with its oral representation in the mental lexicon. The criteria for graphically signaling Portuguese words are the following: a) in which syllable stress falls; b) which is the last vowel, followed or not by s; c) which is the last consonant; d) signaling the difference between diphthong and hiatus. Details and examples will be given bellow. The stress diacritics of Portuguese are acute (chapu hat) and circumflex (voc you). A morphosyntactic diacritic (it does not signal stress) is used for signaling the overlap of the preposition a with the definite article or demonstrative pronoun a/as, or with the same vowel beginning the demonstrative pronoun a/aquela(s), a/aquele(s). For instance, Fui casa da Maria (I went to Marys home), Vamos quele lugar (Lets go to that place). In Portuguese, stress may relate to the last, penultimate, antepenultimate or, much more rarely, to the fourth last syllable of the phonological word, for example, npcias (wedding) /nu.p.si.aS/ [5]. The phonological word in Portuguese is well defined, and its distinctive mark is stress [5]. Thus, the stress position clearly reveals the distinctive vowel [6]. The position of stress does not depend on the phonemic structure of the word. There are no word endings in Portuguese imposing certain stress, but there is a termination which is more frequent, although such frequency is indeterminable phonologically [6]. However, the Portuguese characteristic stress occurs in the penultimate syllable, which gives Portuguese a bass rhythm. Nevertheless, Brazilian Portuguese has more words with
International Journal Publishers Group (IJPG)

stress on the last syllable than European Portuguese, because it incorporated words from the African and Indigenous languages spoken by those who lived together with the Portuguese colonialists in the past [5]. In spite of this, the influence of Indigenous and African languages was only lexical, since no phoneme belonging to them was borrowed by Brazilian Portuguese phonemic system [25]. Another characteristic that makes the Portuguese system of signaling the stressed syllable in the written system effective comes from the fact that it was guided by phonological intuition. Portuguese words main stress is graphically registered according to the pattern frequency in the language. The most frequent word pattern is: 'C(C)V.C(C)V(s)#, where the last vowel must be a, e, o. These words do not receive any written signal representing stress, e.g., mesa (table) /me.za/, escreves (you write) /iS.kr.viS/, livro (book) /li.vru/. The pattern 'C(C)V(s)# is the second most frequent: the last written vowel must be a, e, o. If the last vowel is [-high, -low], it receives a circumflex, e.g., av (grandfather) /a.vo/; if the last vowel is [+low], it receives an acute signal, e.g., sof (sofa) /so.fa/, cafs (coffees) /ka.fS/, vov (grandma) /vo.v/. On the other hand, if the last stressed vowel is i or u for instance, abacaxi (pineapple) and caju (cashew) /a.ba.ka.i/ and /ka.u/ , the word will not receive any diacritic. In Brazil, in most of sociolinguistic varieties, the unstressed final vowels spelled with e and o neutralize in favor of /i/ and /u/, respectively, when pronounced. This neutralization happens because, if the penultimate or antepenultimate syllable of the word is more stressed, the last syllable is reduced: gente (people) /.ti/, carro (car) /ka.u/. Words ending in descending diphthongs without any diacritic must be read with stress falling in the last syllable: plebeu (commoner) /ple.bew/, unio (union) /u.ni.w/. If stress falls in the penultimate syllable in words ending in descending diphthongs, the stressed vowel will be marked with the diacritic: pnei (pony) /po.ne /. In Portuguese, all words stressed in the antepenultimate syllable, since this pattern is the least frequent, have that syllable graphically signaled: nmero (number), clida (warm fem.), znite (zenith) /nu.me.ru/, /ka.li.da/, /ze.ni.ti/. One example of morphosyntactic function of a diacritic occurs with two verbs ter (to have), vir (to come), and their derivatives in the third person plural, present tense, indicative (tm, vm, contm, provm) [3], thus indicating plural, since third person singular is tem, vem, contm, provm). The pronunciation, however, does not change, since singular and plural forms are homophones: vem, vm /v/, /v/. In summary, the Portuguese written system of signaling stress is based on the principle of economy (Occams razor), considering that the most frequent pattern 'CV.CV(s) is

418

International Journal of Advanced Computer Science, Vol. 3, No. 8, Pp. 416-427, Aug., 2013.

the one that does not receive a diacritic. Thus, it facilitates decoding, although being more complicated for coding, especially as it is not properly understood by teachers and, therefore, by students. This system has lost some of the qualities based on phonological intuition, due to diachronic changes in the oral system and the lack of spelling rules based on those changes: the 1991 (2009) agreement made the situation worse. We will come back to this point. C. The Portuguese syllable The syllable is the superior unit in which phonemes (vowels and consonants) combine to allow enunciation [6]. Syllable division is deeply studied by Phonology. Its structure types characterize languages. The basic phonemic structure is the syllable, not the phoneme (Jakobson, 1967 apud [5]). The syllable in Portuguese can be understood as a set of positions (slope or onset, core or nucleus, and decline or coda) to be occupied by specific phonemes. The nucleus of the syllable is the only essential position in Portuguese and should be always occupied by a vowel, which is the predominant phoneme of the syllable. The slope (onset) is occupied by one or more consonants and may not be present in the syllable. Further restrictions are made to what may be in the decline (coda), which accepts only certain consonants and the semivowels /j/, /w/, in Brazilian Portuguese, but coda can also be empty. A basic syllabic schema is displayed in Fig.1. Although it lacks a few optionality marks, and so needs some refinement, it gives an idea of the Brazilian Portuguese phonological syllable structure.

syllables (V) and open complex (CV). Locked or closed syllables are those ending with consonants (VC, CV(C)C). They are much less frequent in Brazilian Portuguese, and there are severe constraints limiting which are the possible consonants in this position [5]. The most complex syllables in Portuguese are the ones that end with two or three phonemes: CCVVC (claus.tro.fo.bi.a /klawS.tro.fo.bi.a/), CCVCC (trans.mu.ta.o /traNS.mu.ta.sawN/ ~ /trS.mu.ta.sw/), and CVCCC (gangs.te.ris.mo /gaN.gS.te.riS.mu/ ~ /g.gS.te.riS.mu/). The CVCCC syllable is only orthographic, since it breaks into two phonological syllables, that is, CVC.CVC or CV.CVC. In the last two examples, there can be two phonological interpretations: the first one considers the existence of nasal consonantal coda and disregards the existence of nasal vowels while the second considers the existence of nasal vowels and the absence of a nasal consonant phoneme in coda position (what the second position admits is the existence of phonetic variants, or allophones, conditioned by the subsequent consonant). One of the most important evidences in favor of the last position is the fact that a velar nasal consonant is produced whenever the following onset is a velar consonant, for instance, in the word canga. First of all, there is no velar nasal phoneme in Portuguese, nor is it possible to commute any of the so called nasal consonants in the internal coda position, surrounded by the same context (minimal pairs), producing change of meaning. Nhenhm syllable parsing favors the second position. The sequence CCCV is not valid for Brazilian Portuguese, although it is valid for European Portuguese [14]. The pronunciation of a foreign word like stress is [is.tr.si], so its written form is estresse, in Brazil. In general, the Portuguese syllable delimitation is clear, but there are three cases where it is floating. There are three groups of vowel contexts in which an unstressed and high vowel may be considered as a semivowel, belonging to a diphthong, or as a vowel, forming a hiatus [6]: a) /i/ or /u/ preceded or followed by another unstressed vowel (variedade, saudade, cuidado), b) /i / or /u/ followed by a stressed vowel (piano, viola), and c) /i/ or /u/ followed by an unstressed vowel at the word ending (ndia, assduo). Phonetically, one can understand these as diphthongs or hiatuses in free variation with no distinctive opposition. Phonologically, however, there is a syllabic not significant variable boundary. In Brazilian Portuguese, they are better understood as hiatus (/va.ri.e.da.di/, /pi..nu/, /vi..la/, /.di.a/, /a.si.du.u/), except in the cases in which the second vowel is i ou u, which are better understood as diphthongs: /saw.da.di/, /ku.da.du /. The above explanation is part of the theory that sustains Nhenhm decoder and its parsing procedures.
International Journal Publishers Group (IJPG)

V vowel; C consonant; { } braces indicate that phonemes inside them may be combined with the respective phoneme on the left or on the right; ( ) parentheses indicate that phonemes inside them may occur in that position or not; | | archiphoneme; / / phoneme.
Fig. 1 Brazilian Portuguese syllabic-phonologic schema [22].

In Fig. Brazilian Portuguese the so called schema free or open 1 Brazilian Portuguese syllabic-phonological syllables, which are the [22]. ones that end with a vowel, predominate. This kind of syllables includes simple

Vasilvski et al.: Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser.

419

3. Methodology, discussion and results


In this section, we present the automatic decoder Nhenhm and the methodology applied to the corpus, due to the close relation between both. They are followed by the results and their discussion. A. The decoder Nhenhm The word that gives the program its name, nhenhm, comes from the Tupi language spoken by several Indian tribes who lived and continue living in Brazil and means the endlessly repetition of lip movement producing sounds, as the voice; therefore, an analogue of that word could be bla-bla-bla. Nhenhm (/./) is a computational program that decodes Brazilians official writing system into phonological symbols, performs syllable division, and marks prosody. This program was used for translating, editing, grouping, and searching the corpus. What inspired the software development, in 2008, was the high level of transparency of Brazilian Portuguese alphabetic system, although there are some problems, namely the fact that the same grapheme e or o represents respectively two different vowels, /e/, // and /o/, //. So, the hypothesis of the availability of the high level of predictability of that system guided the building of a software based on rules, which automatically converted graphemes into phonemes. Methodologically, the applicative development associates Computational Linguistics, Corpus Linguistics, Statistics, Phonology, and Phonetics. Since the program planning combined proper methodology and linguistic theory, the software could be built in a computer programming language which is not specifically planned for the treatment of human language. The symbols Nhenhm uses for the conversions are displayed in Tab.1. The software reads relatively huge bunches of data, and bestow phonologic reports with statistical reports. After examining a phonological corpus rightly assembled, tests done by drawing on the applicative reached no less than 98% of accuracy: they reproduce the portion of the Brazilian writing system that is predictable by decoding rules. In relation to the written system as a hole, the correctness is not less than 95%. It is known that, to implement the rules in certain groups, it is important to identify the syllabic unit [13], [14]; however, the first version of Nhenhm [2] reached at least 95% of accuracy without recognizing the syllabic unit. Such accuracy was measured by testing several texts with the program. Now that Nhenhm parser is ready to approach this issue properly, a lot of new possibilities for language research are available. Besides this performance, the program also reaches at least 99% of precision at signaling words stress. These results confirm the hypothesis, and authenticate the high level of predictability of Brazilian alphabetic system, thanks to its phonological basis. It also corroborates that the Brazilian alphabetic system represents the prosody in a logical, accurate, economic and effective manner.
International Journal Publishers Group (IJPG)

TABLE 1 NHENHM LETTERS, DIGRAPHS AND CORRESPONDING PHONEMES

Graph e e i i o o o o u c c ch g gu h j l l lh lh m n nh qu q r r r r rr s s ss sc s s x x x xc z z

Phon // // // // // // /e/ // // /i/ /i/ // /j/ // // // // /o/ // / / /o/ /w/ /u/ /w/ /u/ // /w/ /s/ /k/ / / / / /g/ / / /w/ /l/ // /l/ /m/ /n/ // /k/ /k/ /r/ |R| // // // /s/ |S| /s/ /s/ /s/ /z/ /kS/ |S| /z/ /s/ /z/ |S|

Example gua (water) quela (to which) lmpada (light bulb) ma (apple) p (foot) contm (it contains) lvedo (barm) tmpora, nfase (temple, emphasis) era (era) elefante (elephant) lvido (livid) lmpido, ndio (clear, Indian) peito (breast) muito (much) ad(i)vento (advent) p (powder) anes (dwarfs) ps (it put past) cmputo, cnscio (calculation, conscious) somente (only) comente (you comment) mo (hand) pato (duck) pau, taquara (wood, bamboo) til (useful) cmplice, anncio (accomplice, ad) cinqenta (fifty) cebola (onion) acudir (to help) achar (to find) gente, agir (people, to act) guerra, guitarra (war, guitar) hoje, ah (today, oh) janela (window) anzol (hook) lenol (sheet), incluso (inclusion) malha (mesh) filhinho (sonny) miar (to meow) ano (year) ninho (nest) quente, caqui (hot, khaki) aqutico (aquatic) cera, prata (wax, silver) amor (love) melro, enredo (blackbird, plot) rosto (face) amarrar (to tie) sapo (frog) mosca, lesma (fly, snail) assar (to bake) fascinante (fascinating) cresa (it grows up) asa (wing) txi (taxi) expor (to expose) exato (exact) exceo (exception) azedo (acid) luz (light)

420

International Journal of Advanced Computer Science, Vol. 3, No. 8, Pp. 416-427, Aug., 2013.

The program does not fulfill some aspects of translating the written texts into phonological transcription, but this happens because there are some exceptions in the Portuguese written system. For instance, in some cases, the letter x values are not all predictable by rules. It can be decoded as five different phonemes: //, /s/, /z/, /kS/, |S|. For example: graxa, sintaxe, exame, nexo, texto /gra.a/, /s.ta.si/, /e.z.mi/, /n.k.su/, /teS.tu/. Among the given five possibilities, two are predictable: the value /z/, when x, after the initial letter e, begins the syllable and the value // when the letter x ends a syllable, followed by an initial unvoiced consonant. Another predictable value is //, when the letter x begins the word. There are also some cases of ambiguity, for instance, the letter s value after b, e.g.: observar (to observe) /o.b.seR.vaR/, obsquio (favor) /o.b.z.ki.u/. So, we consider that s as representing an archiphoneme: /o.b.Ser.vaR/ and /o.b.S.ki.u/ [16]. Morphology can also provoke unpredictable situations. For example, the prefix trans-, which means accross, causes a pronunciation ambiguity: transamaznica (trans+amaznica) is correctly decoded /tr.za.ma.zo.ni.ka/, but transiberiana (trans+siberiana) was decoded */tr.zi.be.ri..na/ instead of /tr.si.be.ri..na/, because there is resyllabification. This problem can only be solved by associating morphological and phonological rules in the program. We approached this issue deeply in previous works [1], [15], [24], and managed to fix it in 2011 [22]. Furthermore, the vowels [+low] // and // are written e and o, as mentioned, which makes it hard to predict their values, since /o/ and /e/ have the same coding. When they

are stressed and also signaled graphically, the conversion is correct. The reduction of pre-tonic and pos-tonic vowels is also not properly addressed in the Nhenhm algorithm. Regarding this, it is worth pointing that it is subjective, since, most of the time it depends on the speakers linguistic variety. Thus, it is an issue for Phonetics, not for Phonology, when all distinctive traces are preserved, no matter the variety. Moreover, we decided to consider the so called arising or ascending diphthong as hiatus [5],[10], therefore, words ending with it are decoded as receiving stress on the antepenultimate syllable: sseo /.si.u/, histria /iS.t.ri.a/, nusea /naw.zi.a/, cio /.si.u/. In 2010, Nhenhm was translated into another computer language, and so we could improve its performance (Fig.2). We incremented the main algorithm and the system became able to provide the phonological syllabic division. As a consequence, we obtained the spelling syllabic division, with at least 99% accuracy. In this way, it became easy to signal the stressed syllable, since its 2008 version signaled only the stressed vowel. We used this renewed algorithm to make an automatic syllable parser for Brazilian Portuguese [15], and we had to solve the problem of syllabification of words that contained hyphen, such as beija-flor (hummingbird), p-de-moleque (a peanut candy), dever-se-ia (verb to have a duty, third person singular, past future indicative, synthetic passive voice, with tmesis), and solved them [16]. In addition, we built an interface between Nhenhm and the software Laa-palavras [17], [18], which is used for linguistic research. Furthermore, we used the Nhenhm

Fig. 2 Main screen of the program Nhenhm 2012, integration with parsing [22]

International Journal Publishers Group (IJPG)

Vasilvski et al.: Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser.

421

prosodic-phonological algorithm for building a program for speech therapy [19], consulting specific literature [20]. This program has been presented [19], [21], commented [22], and tested [23] the results were encouraging [23]. The text is converted while the user types it or pastes it. Pasted texts must have simple formatting, that is, no capital letters. The stressed vowel is signaled by an order from the user. Fig. 2 shows Nhenhm decoder-parser performance. In the field Sada 1, the text entry appears converted into phonological symbols and parsed; the stressed syllable is marked by the prosody mark before its first symbol. In the field Sada 2, the text appears orthographically parsed. There is only one mistake in Fig.2 converted text: the word correto (correct), should be decoded as /ko..tu/. The Nhenhm user can automatically convert either one word or a 20 pages text, edit it, save it, research it and print it. As the system conversion is rightly esteemed on at least

95% of accuracy, it allows the user to edit the unsolved 5% (or less) failure rate text, converting, replacing and inserting symbols, adjusting to dialects. The program also allows several texts to be recorded in a database for specific use in statistical reports. B. Basic functioning of the automatic syllable parser As shown in the flowchart (Fig.3), the phonological syllable parsing of a word depends on the preceding phoneme. Thus, if the current phoneme is a consonant or a vowel, and the preceding one is a vowel, semivowel or an archiphoneme, then the current phoneme occupies a syllable boundary position. Consequently, the syllable division marker a dot is inserted before it. For instance, the graphic word angstias (anguishes), regarding its phonological form, is converted as /g'uStiaS/, and then parsed as /.'guS.ti.aS/, regarding its phonological syllables.

Fig. 3 Syllable parsing basic computational process

International Journal Publishers Group (IJPG)

422

International Journal of Advanced Computer Science, Vol. 3, No. 8, Pp. 416-427, Aug., 2013.

So, this word has four vowels //, /u/, /i/ and /a/, which means that it has four syllables therefore, the program shall insert three syllable markers in it. In this case, the first syllable ends before the consonant /g/, since the phoneme that precedes it, that is, //, is a possible syllable boundary. Then, once there are more phonemes, the search for the second syllable boundary begins. The second syllable of that word ends before the consonant /t/, because it is preceded by the archiphoneme |S|, which is also a possible syllable boundary. Following, there is another phoneme, the vowel /i/, and this means that there is a third syllable. Next, there is the phoneme /a/, that is preceded by another vowel, the mentioned /i/, and which is considered a possible syllable boundary, as seen. This situation leads the program to mark the third syllable ending before /a/. Since /a/ is a vowel, it belongs to another syllable, there is, the fourth one, that ends with |S|. So, for there are no more vowels in the word, it has no more syllables, and the parsing procedure is complete. In this process, the program takes into account all the rules exposed in section 2.C. The code snippet showed in Fig. 4 is part of the form belonging to Nhenhm parser, displayed in Fig. 2. This code is executed every time the input text is changed, and it is used to return both the phonological syllabic division as the spelling syllabic division. Line 81, which is highlighted, calls the method to parse.

The parsing method was created by using four parameters, as shown in the image below (Fig.5). About these parameters, it is worth telling that: _inOrto is the original entry typed by the user (conhecer to know); _inFono is the phonological transcription for the orthographic input (keseR); _sFono receives the return of the syllabic phonological parsing (k.e.seR); _sRev receives the return of the orthographic syllabic division (co-nhe-cer). The processes that are responsible for the return of the last two parameters are quite complex, so they have to be the subject of a future work. C. Phonologic-syllabic Corpus In order to test Nhenhm, and also to investigate phonologic and syllabic patterns of Brazilian Portuguese, from written texts, we assembled a corpus with six articles, published in 2007 in a journal of Brazilian dentistry. They are technical and scientific texts, revised, and updated, which were not produced to be used in linguistics research [26], [27]. The six texts were pre-edited in a text editor, individually, before pasting on Nhenhm. Foreign words, words that contained graphemes that do not belong to Portuguese written system and measurement units were eliminated, as well as some acronyms. Some of them could

Fig.4 Code snippet of the parser.

Fig.5 The four parameters of the parser.

International Journal Publishers Group (IJPG)

Vasilvski et al.: Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser.

423

be replaced by their spelling form. The system excludes punctuation, hyphen, quotation marks, and some other symbols by itself, so, they do not need to be treated previously. In order to reduce chances of conversion errors, care must be taken to ensure the texts perfect readability by Nhenhm. After this preparation, the corpus texts were pasted on the program, converted, printed, checked, edited, rechecked, and saved for research. The exceptions were searched and edited so as to obtain correct translations. The numbers (ages, dates, centuries) were replaced by their spelling forms. D. Statistical Reports: The Patterns The six texts were loaded for generating statistical reports regarding phonemic and syllabic distribution. The reports display the numbers which will be now exposed, and, as such, are reliable. 1) The Phonemic Patterns: The corpus, after conversion, totalized 70,811 phonemes, being distributed into 33,510 syllabic phonemes (vowels), 3,069 non-syllabic phonemes (semivowels), and 34,232 consonant phonemes. Such numbers represent 47.32%, 4.33%, and 48.34% respectively of the total. In regard to the vowel phonemes, their distribution is: Tongue position: 42.95% front, 57.05% back; Tongue height: 43.75% high, 25.07% mid, 31.18% low; Airstream way (refers to the route taken by the air flow during vocalization): 87.50% oral, 12.50% nasal; Lip rounding: 29.25% rounded, 70.75% unrounded. It is worth remembering that all vowels are voiced. The distribution of consonants is: Manner of articulation: 51.77% occlusive, and 48.23% constrictive, distributed as follows: 61.68% fricative, 29.39% vibrating, 8.93% lateral; Place of articulation: 64.67% front, 16.50% back, 18.83% labial; Airstream way: 90.57% oral, 9.43% nasal (oral and nasal); Phonation: 48.14% unvoiced, 51.86% voiced the consonantal archiphonemes |S| and |R| are not included in the numbers concerning phonation, because they are the result of neutralization of features. Also, the statistical report provides phoneme individual distribution, as Tab. 2 displays for the corpus as a whole. To confirm the results, we tested only one of the six texts belonging to the corpus (10,904 phonemes), the numbers of which we present in detail (Fig. 6). It can be seen that the main features distribution is very similar, as well as the other numbers provided by such report, which indicates phonemic patterns. A journalistic text composed by 8,454 phonemes was also prepared and tested individually by Nhenhm, and the results were similar, since the differences were around 1% to 1.5%. Hence, the results and also the numbers that show the phonologic patterns of Brazilian Portuguese seem reliable. The individual distribution regarding the text whose patterns are shown in Fig. 6 was presented before [24], and so they could be consulted for making a comparison with the numbers exhibited in Tab. 2.

TABLE 2 CORPUS PHONEME INDIVIDUAL DISTRIBUTION

Ph /a/ /i/ /u/ /t/ /d/ /e/ /S/ /s/ /r/ /k/ /o/ /p/ / / /w/ /m/ /l/ /R/ //

Q 8851 7587 4618 4464 4124 3861 3538 3177 2961 2754 2571 2208 1966 1964 1849 1419 1341 1317

% 12.50% 10.74% 6.56% 6.30% 5.82% 5.45% 5.00% 4.49% 4.18% 3.89% 3.63% 3.12% 2.78% 2.74% 2.61% 2.00% 1.89% 1.86% Total:

Ph /n/ /z/ / / /v/ /f/ // // /b/ / / / / // / / /g/ // // // // / / 36

Q 1304 1152 1084 934 887 774 677 568 560 551 516 406 375 212 90 75 55 21 70811

% 1.84% 1.63% 1.53% 1.32% 1.25% 1.09% 0.96% 0.80% 0.79% 0.78% 0.73% 0.57% 0.53% 0.30% 0.13% 0.09% 0.07% 0.03% 100%

We tried to find another program or even a study that approaches this issue in a similar way, that is, one that classifies the segments according to their features and informs such statistics, using corpus, but we did not find any. So, for awhile, we could not make comparisons in order to confirm the reliability of the numbers we have presented. We will comment some results, but much more can be said about them. The back or posterior vowels occur around 15% plus than the front or minus posterior vowels. The most frequent posterior ones are /a/ and /u/; among the front vowels, /i/, which occurs only 1% less than /a/, is the most frequent. Thus, the vowel that occurs most in Portuguese is /a/, followed by /i/. The semivowel // occurs only in the word muito (many, much) /mu.tu/ and derived forms. This is the only symbol used by Nhenhm that may not be correctly read by all computers. We have been studying a way for changing it, so as to overcome the little disturbance it may bring to the decoder-parser. Some tests done have shown that the other symbols are read like Normal Text by any computer. The // is computed with /i/, since the first occurs when there is a sequence of two consonants in a word which ordinarily are not a coda (decline), and belong to different syllables. In this case, the epenthetic // occurs while such sequence is pronounced. So, this inserted phoneme works as core of a phonological syllable: opo (option), cacto (cactus) /o.p.sw/, /ka.k.tu/.

International Journal Publishers Group (IJPG)

424

International Journal of Advanced Computer Science, Vol. 3, No. 8, Pp. 416-427, Aug., 2013.

Fig. 6 Nhenhm statistical report general distribution [22], [24].

Looking at the oral and nasal features, both for consonants and vowels, we can see that Portuguese is predominantly oral, since around 89% of it is oral and only around 11% is nasal. See that the most frequent nasal phonemes are the vowel // (13th place in occurrence) and the consonant /m/ (15th place in occurrence). Then, the 12 most frequent phonemes of Portuguese are oral. In relation to the consonant phonemes, there is a balance between the occurrence of constrictive and occlusive, although occlusive tends to occur around 3% more than the constrictive ones. The two most frequent consonants of Portuguese are the occlusive /t/ and /d/. Another fact that calls attention is that the fricative sound //, the nasal occlusive //, and the posterior lateral sound // are actually rare in this language. Nevertheless, most of them appear in all six texts that form the corpus, for they participate in some very common words, like chegar, deixar (to arrive, to leave) /e.gaR/, /de.aR/, conhecer, tamanho (to know, size) /k.e.seR/, /ta.m.u/, trabalho (work) /tra.ba.u/. The only sounds that do not occur are / / in one text, and //, in another. From the results, we find that Brazilian Portuguese phonemic distribution is uniform, once the amount of vowels and consonants tend to be around 50% each. The semivowels reveal the amount of diphthongs (the real ones, that is, falling or decreasing diphthongs), since the semivowels only occur in this case. The diphthongs of

the kind vowel+/w/ are more frequent than the ones of the kind vowel+/j/, since the first kind appears, at the very least, 64% more times than the second. Furthermore, it is feasible hypothesizing that CV (consonant+vowel) is the most common syllable pattern of Brazilian Portuguese, what leads us to address the Brazilian Portuguese syllable. 2) The Syllable Patterns: The phonological-syllabic report, grouping the six texts of the corpus by syllabic frequency, reveals 628 syllable types it means that there is no many more than that in Brazilian Portuguese and that the corpus as a whole is formed by 33.960 syllables. Tab. 3 shows the 30 most common syllable types. According to Tab. 3, the most frequent syllable of Brazilian Portuguese is formed by the vowel /a/, that, besides occupying any position in words, and combining with any consonant or groups of consonants, semivowels, also form a syllable by itself. Moreover, it is the feminine singular determiner; a preposition; an accusative pronoun; and a demonstrative. It represents around 5% of the syllables of the corpus, and Tab. 4 confirms this condition. Since 22 of the 30 syllables presented are open complex, that is, a CV syllable, we can conclude that CV is the most frequent syllable pattern of Brazilian Portuguese. There are two syllables with the nasal sound //, what confirms this sound as the most frequent among the nasal vowels, like the phonemic report indicates.
International Journal Publishers Group (IJPG)

Vasilvski et al.: Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser. TABLE 3 THE 30 MOST FREQUENT SYLLABLES OF BRAZILIAN PORTUGUESE

425

research. However, a deeper analysis of the numbers presented is a subject to further studies.
TABLE 4 THE 10 MOST FREQUENT SYLLABLES OF EACH TEXT OF THE CORPUS

Syllable types /a/ /di/ /si/ /ti/ /u/ /du/ /ta/ /da/ /i/ /tu/ /ra/ /k/ /m/ /pa/ /ka/ /sw/ /ri/ /na/ /li/ /se/ /e/ /o/ /e/ /d/ /eS/ /te/ /zi/ /ma/ /ki/ /duS/ ... /eR/ 628

Occurrences 1710 1286 1117 919 813 722 694 642 631 612 574 554 537 490 467 455 432 428 397 386 384 362 346 328 305 291 288 286 280 278 ... 1 33960

% 5.04% 3.79% 3.29% 2.71% 2.39% 2.13% 2.04% 1.89% 1.86% 1.80% 1.69% 1.63% 1.58% 1.44% 1.38% 1.34% 1.27% 1.26% 1.17% 1.14% 1.13% 1.07% 1.02% 0.97% 0.90% 0.86% 0.85% 0.84% 0.82% 0.82% ... 0.003% 100%

T e x t 1

Syll. /a/ /di/ /si/ /u/ /ta/ /tu/ /du/ /ti/ /m/ /i/ /a/ /di/ /si/ /ti/ /du/ /ta/ /u/ /ra/ /k/ /i/ /a/ /di/ /si/ /ti/ /se/ /du/ /da/ /ta/ /u/ /ra/

Q 303 243 218 212 180 173 168 167 154 118 233 179 168 156 140 128 127 104 102 95 316 248 225 180 164 155 153 148 148 142

% 4.74% 3.80% 3.41% 3.31% 2.81% 2.70% 2.63% 2.61% 2.41% 1.84% 4.38% 3.37% 3.16% 2.94% 2.63% 2.41% 2.39% 1.96% 1.92% 1.79% 4.52% 3.55% 3.22% 2.57% 2.34% 2.22% 2.19% 2.12% 2.12% 2.03%

T e x t 4

Syll. /di/ /a/ /si/ /ka/ /ti/ /da/ /u/ /ri/ /i/ /ta/ /a/ /ti/ /di/ /si/ /i/ /u/ /du/ /tu/ /e/ /na/ /a/ /di/ /si/ /ti/ /da/ /u/ /i/ /ta/ /du/ /tu/

Q 167 165 110 94 84 79 77 63 60 53 269 147 145 136 111 105 104 97 96 94 424 304 260 185 150 144 128 114 111 108

% 4.91% 4.85% 3.23% 2.76% 2.47% 2.32% 2.26% 1.85% 1.76% 1.56% 5.65% 3.09% 3.05% 2.86% 2.33% 2.21% 2.19% 2.04% 2.02% 1.98% 5.98% 4.29% 3.67% 2.61% 2.12% 2.03% 1.81% 1.61% 1.57% 1.52%

T e x t 2

T e x t 5

T e x t 3

T e x t 6

Concerning the least frequent syllables, we found that 97 types occurs only one time (e.g. /kR/, /pR/, /liR/, /e/, /vre/, /loR/, /ka/, /braS/, /kwaR/); 58 types occur two times (e.g. /uS/, /vw/, /moR/, /p/, /baS/, /row/, /tuR/, /glo/, /baR/, /tr/); and 31 types occur three times (e.g. /neS/, /saR/, /niR/, /vriS/, /bru/, /bR/, /dr/, /deR/, /k/, /ke/). See Tab.1 for some examples. There is no closed syllable among the 10 most frequent ones in the six texts of the corpus (Tab.4). No complex slope (consonant clusters and velar+/w/) occurs among the 30 most frequent syllables (Tab.3). Although the parser was created in 2010, this is the first time it is used for research, and that the results are shown. The syllabic report confirms the phonemes distribution report and opens a bunch of possibilities for language
International Journal Publishers Group (IJPG)

Nhenhm decoder let to know Brazilian Portuguese distribution of phonemic patterns, and the parser allowed understanding how such phonemes combine in a linguistic unit larger than the phoneme: the syllable. E. The Spelling Agreement of 1991 (2009) Some changes will occur in Brazilian Portuguese spelling, due to the spelling agreement already mentioned, according to which at least seven of the countries where Portuguese is spoken must use the same spelling, from 2013 on. Although most users, even the Press, have adopted the new rules, the Brazilian government postponed for 2016 the requirement for using the new orthographic rules. The most important change for Brazilian Portuguese orthography is the exclusion of the shudder (trema). Consequently, the value of the digraphs qu- and gu- becomes unpredictable. Thus, agentar (to stand) and

426

International Journal of Advanced Computer Science, Vol. 3, No. 8, Pp. 416-427, Aug., 2013.

eqino (horse), correctly decoded as /a.gw.taR/ and /e.kwi.nu/, will be spelled aguentar and equino, generating the translations */a.g.taR/ and */e.ki.nu/. In spite of this, in this case, the syllable division remains correct, but not the syllabic structure affected by the new rules, of course. In Brazil, shudder use is still very common, even by government institutions. For this reason, Nhenhm will preserve this resource in its algorithm. This means that the alphabetic system loses transparency, that is, it loses one of the rules that make it predictable; therefore, reading (decoding) is impaired. Other changes interfere less in the automatic translation see [22] for more details , but none of them disturbs the prosody system.

phenomena involved in its development. Thus, it is an application for assisting speech therapy, and so language acquisition research. Now we are making this program able to create graphics from the registers, and automatically identify the phonological processes involved in childs speech. Back to the algorithm, we are still working on rules for reducing that 5% (probably much less now) failure rate at the conversion. Once it was made clear that the conversion tool successfully exploits the close correspondence between orthographic representation and pronunciation in Brazilian Portuguese, at the phoneme and syllable level, it proved to be useful in a wide range of applications.

4. Conclusion and Outlooks


The experience of building, testing and using Nhenhm has shown the degree of electronic text linguistic reading and conversion difficulty. The phonemic level is the easiest to systematize, and only a few questions regarding it remain unsolved; the difficulty is greater for the syllable level, but it can be also regarded as superseded; the morphology level comes next, yet, a little progress has been accomplished concerning this field; and then the syntax, which is more intricate. The complexity of each level may be attenuated by the systematization of previous levels, because one takes advantage of the systematization of the other. So, converters like Nhenhm are a step for future work on levels that transcend the phoneme, like we did to the syllable. Some decisions taken in the system building are objectionable to some and noteworthy to others, as are some of the theories chosen. However, this was not optional. The choices came from the need imposed by the programming and, within that, objectivity and intelligibility of existing theories, and beliefs and intuition of teachers, students and other language users. The efficiency of Nhenhm confirms the usefulness of the theories adopted. Now that we have made the automatic syllable parsing, the project goes on. Systematization of the Brazilian Portuguese syllable made possible to start addressing morphology, as we said. Hence, there is still a lot to be studied about the syllable, since the reports are available. In this sense, one question to be approached deeply relates to prosody a task that has already started [28]. Nevertheless, there are still some adjustments and increments to be made in the parser, in order to optimize the editing by the user. Yet, some of the next steps are building a voice synthesizer from Nhenhm, and improving Nhenhm Fonoaud, which is the program for speech therapy; this program benefits from automatic syllable division already. The program supports the analysis of processes that occur in the childs phonological system, through the automatic phonological transcription simultaneously to samples of the child speech recording. Thus, data relies on a phonemic representation of speech, automatically done by the Nhenhm phonological-prosodic algorithm. NhFonoaud is designed for dealing with phonological tests, using words wittingly grouped to analyze specific aspects of speech and

References
[1] S. Silva Neto. Histria da lngua portuguesa. Fifth Edition. Rio de Janeiro: Presena, 1988. [2] V. Vasilvski. Construo de um programa computacional para suporte pesquisa em fonologia do portugus do Brasil. (2008). PhD Thesis, Federal University of Santa Catarina, Florianpolis, Brazil. [3] L. Scliar-Cabral. Princpios do sistema alfabtico do portugus do Brasil. So Paulo: Contexto, 2003a. [4] B. Malmberg. A fontica: teoria e aplicaes, (1993). Caderno de Estudos Lingsticos, no.25, pp.7-24. [5] J. M. Cmara Jr. Estrutura da lngua portuguesa. 16th. Edition. Petrpolis: Vozes, 1986. [6] J. M. Cmara Jr., Joaquim Mattoso. Problemas de Lingstica descritiva. 16th. Edition. Petrpolis: Vozes, 1997. [7] J. M. Cmara Jr.. Para o estudo da fonmica portuguesa. Second Edition. Padro: Rio de Janeiro, 1977. [8] M. Said Ali. Gramtica secundria e Gramtica histrica da lngua portuguesa. Third Edition. Braslia: Editora da UnB, 1964. [9] E. Bechara. Moderna gramtica portuguesa. 19th Edition. So Paulo: Cia. Editora Nacional, 1973. [10] L. Bisol. O ditongo da perspectiva da fonologia atual, (1989). Revista Delta, vol.5. no.2, pp.185-224. [11] L. C. Cagliari. Anlise fonolgica: introduo teoria e prtica. Campinas: Mercado das Letras, 2002. [12] International Phonetic Alphabet (IPA). 2013. http://www.langsci.ucl.ac.uk/ipa/ipachart.html [13] J. J. Almeida, A. Simes. Text to speech A rewriting system approach, (2001). Procesamiento del Lenguaje Natural, vol. 27, pp. 247-255. [14] S. Candeias, F. Perdigo. Conversor de grafemas para fones baseado em regras para portugus, (2008). L. Costa, D. Santos, N. Cardoso (Eds.). Perspectivas sobre a Linguateca/Actas do encontro Linguateca: 10 anos, n.14, pp.99-104. [15] V. Vasilvski. Diviso silbica automtica de texto escrito baseada em princpios fonolgicos, (2010). Anais do III Encontro de Ps-graduao em Letras da UFS (ENPOLE), So Cristvo, Sergipe, Brazil [16] V. Vasilvski. O hfen na separao silbica automtica, (2011). Revista do Simpsio de Estudos
International Journal Publishers Group (IJPG)

Vasilvski et al.: Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser.

427

Lingsticos e Literrios SELL, vol.1, no.3, pp. 657-676. [17] V. Vasilvski, M. J. Arajo. Laa-palavras: an electronic system for the Description of Brazilian Portuguese. Florianpolis: LAPLE-UFSC, 2010-2013. http://www./sites.google.com/ site/sisnhenhem/ [18] L. Scliar-Cabral, V. Vasilvski. Descrio do portugus com auxlio de programa computacional de interface, (2011). Anais da II Jornada de Descrio do Portugus (JDP), Cuiab, Brazil. [19] H. F. Blasi, V. Vasilvski. Programa piloto para transcrio fontica automtica na clnica fonoaudiolgica, (2011) Documentos para el XVI Congresso Internacional de la ALFAL, Universidad de Alcal, Alcal de Henares/Madrid. [20] L. Scliar-Cabral. Guia prtico de alfabetizao. So Paulo: Contexto, 2003b. [21] T. M. Garcez, H. F. Blasi, V. Vasilvski. Aplicao do programa piloto para transcrio fontica automtica na clnica fonoaudiolgica, (1011). Anais do 19. Congresso Brasileiro e 8. Congresso Internacional de Fonoaudiologia. So Paulo, Brazil. http://www.sbfa.org.br/portal/ suplementorsbfa [22] V. Vasilvski. Descodificacin automtica de la lengua escrita de Brasil basada en reglas fonolgicas. Saarbrcken: Editorial Acadmica Espaola, 2012. [23] V. Vasilvski, M. J. Arajo, Blasi, H. F. A Brazilian Portuguese Phonological-prosodic Algorithm Applied to Deviant Language Acquisition: A Case Study, (2013). Paper to be presented. [24] V. Vasilvski. Phonologic Patterns of Brazilian Portuguese: a grapheme to phoneme converter based study, (2012b). Proceedings of the EACL, Workshop on Computational Models of Language Acquisition and Loss. University of Avignon, France. [25] J. M. Cmara Jr. Lnguas europias de ultramar: o portugus do Brasil. (1972). In: C. E. F. Ucha (org.). Dispersos de J. Mattoso Cmara Jr. Rio de Janeiro: Fundao Getlio Vargas, pp.71-93. [26] J. Sinclair. Corpus, concordance, collocation. Oxford University Press: Oxford, 1991. [27] G. Leech. Corpora and theories of linguistics performance, (1992). J. Svartvik (Org.). Directions in corpus linguistics, Berlim: Mouton de Gruyter. [28] V. Vasilvski, M. J. Arajo. Um Algoritmo Prosdico para Portugus do Brasil, (2013). Paper to be presented.

Vera Vasilvski Post-Phd Student on Language Acquisition (Phonology and Morphology) at Federal University of Santa Catarina (UFSC/CAPES), Brazil. Professor at State University of Ponta Grossa (UEPG), Paran, Brazil. Research Group Emergent Linguistic Productivity (CNPq). Leonor Scliar-Cabral Prof. Emeritus at UFSC, Dr. in Linguistics/Psycholinguisti cs at University of So Paulo/Brazil. Honorary President of the International Society of Applied Psycholinguistics (ISAPL). Responsible for the Research Group Emergent Linguistic Productivity CNPq/CAPES. Mrcio Jos Arajo System Developer Analyst at Topdata Automation Systems, Electrical Engineering graduate student at Federal Technological University of Paran, Brazil. Natural Language Processing programs developer. Research Group Emergent Linguistic Productivity.

International Journal Publishers Group (IJPG)