Proc Fonetik 2008

Proceedings
FONETIK 2008
The XXI
st
Swedish Phonetics Conference
June 1113, 2008

Department of Linguistics
University of Gothenburg

Proceedings FONETIK 2008
The XXI
st
Swedish Phonetics Conference,
held at University of Gothenburg, June 1113, 2008
Edited by Anders Eriksson and Jonas Lindh
Department of Linguistics
University of Gothenburg
Box 200, SE 405 30 Gothenburg
ISBN 978-91-977196-0-5
The Authors and the Department of Linguistics
Printed by Reprocentralen, Humanisten, University of Gothenburg.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
iii

Preface
This volume contains the contributions to FONETIK 2008, the Twenty-first Swedish
Phonetics Conference, organized by the Phonetics group at the University of
Gothenburg on J une 1113, 2008. The papers appear in alphabetical order of the
surname of the first author.
Only a limited number of copies of this publication have been printed for
distribution among the authors and those attending the conference. For access to
electronic versions of the contributions, please look under:
http://www.ling.gu.se/konferenser/fonetik2008/papers/Proc_fonetik_2008.pdf
We would like to thank all contributors to the Proceedings. We are also indebted to
Fonetikstiftelsen for financial support.
Gteborg, J une 2008

Anders Eriksson sa Abelin J onas Lindh

iv
Previous Swedish Phonetics Conferences (from 1986)

I 1986 Uppsala University
II 1988 Lund University
III 1989 KTH Stockholm
IV 1990 Ume University (Lvnger)
V 1991 Stockholm University
VI 1992 Chalmers and Gteborg University
VII 1993 Uppsala University
VIII 1994 Lund University (Hr)
1995 (XIIIth ICPhS in Stockholm)
IX 1996 KTH Stockholm (Nsslingen)
X 1997 Ume University
XI 1998 Stockholm University
XII 1999 Gteborg University
XIII 2000 Skvde University
XIV 2001 Lund University
XV 2002 KTH Stockholm
XVI 2003 Ume University (Lvnger)
XVII 2004 Stockholm University
XVIII 2005 Gteborg University
XIX 2006 Lund University
XX 2007 KTH Stockholm


v
Contents
Speech production
From articulatory to acoustic parameters non-stop 1
Elisabet Eir Cortes and Bjrn Lindblom

(Re)use of place features in voiced stop systems: Role of phonetic constraints 5
Bjrn Lindblom, Randy Diehl, Sang-Hoon Park and Giampiero Salvi

On the Non-uniqueness of Acoustic-to-Articulatory Mapping 9
Daniel Neiberg and G. Ananthakrishnan

Pronunciation in Swedish encyclopedias: phonetic transcriptions and sound 13
recordings
Michal Stenberg
Prosody I
EXPROS: Tools for exploratory experimentation with prosody 17
Joakim Gustafson and Jens Edlund

Presenting in English or Swedish: Differences in speaking rate 21
Rebecca Hincks

Preaspiration and perceived vowel duration in Norwegian 25
Jacques Koreman, William J. Barry and Marte Kristine Lindseth

The fundamental frequency variation spectrum 29
Kornel Laskowski, Mattias Heldner and Jens Edlund
Speech technology
Speech technology in the European project MonAMI 33
Jonas Beskow, Jens Edlund, Bjrn Granstrm, Joakim Gustafson, Oskar Jonsson and
Gabriel Skantze

Knowledge-rich model transformations for speaker normalization in speech 37
recognition
Mats Blomberg and Daniel Elenius

Development of a southern Swedish clustergen voice for speech synthesis 41
Johan Frid

vi
Speech perception
The perception of English consonants by Norwegian listeners: A preliminary report 45
Wim A. van Dommelen

Reaction times in the perception of quantity in Icelandic 49
Jrgen L. Pind

Emotion discrimination with increasing time windows in spoken Finnish 53
Eero Vyrynen, Juhani Toivanen and Tapio Seppnen

Looking at tongues can it help in speech perception? 57
Preben Wik and Olov Engwall
Variation and change I
Human Recognition of Swedish Dialects 61
Jonas Beskow, Gsta Bruce, Laura Enflo, Bjrn Granstrm and Susanne Schtz

F
0
in contrastively accented words in three Finnish dialect areas 65
Riikka Ylitalo
Speech acquisition, speech development and second language
learning
Improving speaker skill in a resynthesis experiment 69
Eva Strangert and Joakim Gustafson

Second-language speaker interpretations of intonational semantics in English 73
Juhani Toivanen

Measures of continuous voicing related to voice quality in five-year-old children 77
Mechtild Tronnier and Anita McAllister
Prosody II
On final rises and fall-rises in German and Swedish 81
Gilbert Ambrazaitis

SWING: A tool for modelling intonational varieties of Swedish 85
Jonas Beskow, Gsta Bruce, Laura Enflo, Bjrn Granstrm and Susanne Schtz

Recognizing phrase and utterance as prosodic units in non-tonal dialects of Kammu 89
Anastasia Karlsson, David House and Damrong Tayanin


vii
Phonological association of tone. Phonetic implications in West Swedish and East 93
Norwegian
Tomas Riad and My Segerup
Variation and change II
Vowels in rural southwest Tyrone 97
Una Cunningham

The beginnings of a database for historical sound change 101
Olle Engstrand, Ptur Helgason and Mikael Parkvall

Author index 105


viii


A) x-ray profile
C) Mid-sagittal distances D) Distance-to-area rules
E) Area functions derived
F) Area functions run
through software
G) Software returns formant
frequencies
B) Vocal tract midline
A = d
From articulatory to acoustic parameters non-stop
Elisabet Eir Cortes and Bjrn Lindblom
Department of Linguistics, Stockholm University

Abstract
This paper reports an attempt to map the time
variations of selected articulatory parameters
(from X-ray profiles) directly on the F1, F2 and
F3 formant tracks using multiple regression
analysis (MRA). The results indicate that MRA
can indeed be useful for predicting formant
frequencies. Since the results reported here are
limited to preliminary observations of F1 only,
further studies including F2 and F3 are needed
to evaluate the method more definitively.
Introduction
The traditional method of calculating the for-
mant pattern associated with a set of articulato-
ry measurements goes by way of the cross-
sectional area function (Fant 1960). Heinz and
Stevens (1964) proposed a procedure which de-
rives vocal tract cross-sectional areas from
cross-distances. At each point in the vocal tract,
the following formula (Equation 1) relates the
mid-sagittal distance d to the area A of the
cross-section at that particular point:
A = d
(Eq 1)
where and are constants dependent on
speaker and position along the vocal tract. The
7 steps of this method can be summarized as in
Fig.1.

Figure 1. The traditional method of deriving for-
mants from articulatory information.

The performance of this method was recent-
ly evaluated by Ericsdotter (2005). With the aid
of MR images from two subjects articulating
Swedish vowels, she was able to test Equation
(1) at a number of locations both in the pharynx
and in the oral cavity. By and large the method
was found to be descriptively adequate. She
found that the values of and varied with po-
sition in the vocal tract and between subjects.
By taking vowel identity into account the area
predictions were somewhat improved. Yet,
acoustically, this improvement seemed to be of
only slight acoustic importance (p.155). It is
interesting to note that in many cases Ericsdot-
ter in fact found that a roughening of the me-
thod, reducing the number of cross-sections and
thus the number of equations, did not severely
worsen the acoustic outcome. We will have
more to say about that observation anon.
Work on articulatory modeling (e.g. the
APEX model (Stark et al 1996) indicated that
variations in articulatory parameters such as the
jaw, the front-back dimension of the tongue and
the elevation of the tongue blade have an ap-
proximate but fairly simple relationship to for-
mant changes (Lindblom & Sundberg 1971).
Examples of such rules of thumb are:

F1 is controlled by the jaw
F2 is controlled by the front-back movements of
the tongue body
F3 is controlled by the tongue blade (elevation
opens a cavity under the tongue blade, which re-
turns a low F3 in i.e. retroflexed articulation).

The following question arises: Would it be
possible to express such rules of thumb more
quantitatively? Suppose we have access to arti-
culatory data, could we simplify the procedure
of Fig.1 by eliminating altogether the interme-
diate stages B-F, thus moving straight from A
to G in 2 steps (see Fig.2 below).
How well would formants be predicted by
such a drastic shortcut method? Will the accu-
racy of the predictions be satisfactory for vari-
ous applications, say educational (teaching
acoustic phonetics) and technical (articulatory
synthesis), to name some?

Figure 2. The method under investigation
On the down side an immediate objection
springs to mind: the well-documented non-
linearity in the relationship between articulation
and acoustics (Fant 1960; Stevens 1998). The
non-monotone nature of this relationship sug-
gests that a direct connection will be hard to
find. However, there are also certain positive
considerations to make.
First, our research group is fortunate to have
access to a unique collection of X-ray record-
ings with synchronically recorded sound (Bran-
derud et al 1998). By extracting parallel articu-
latory and acoustic measurements, the opportu-
nity opens for investigating the question raised
earlier, and to give an empirical answer to how
well formants can be predicted non-stop from
articulatory data.
Second it might be useful to explore certain
statistical techniques, for instance multiple re-
gression analysis (MRA) which is a method
that can be used to numerically relate a depen-
dent variable (a formant frequency, for in-
stance) to several independent variables (articu-
latory parameter values, for instance). Impor-
tantly this method is not limited to linear re-
gressions since independent variables could be
defined as various mathematical functions of
the articulatory parameters.
The aim of the current study is thus as fol-
lows: To select articulatory data from an X-ray
film of a male speaker focusing on the frames
associated with vowels and vowel transitions,
make moment-by-moment comparisons with
the corresponding formant measurements, and
investigate the numerical relationships (linear
or not) that obtain between them. For the pur-
pose of the present report, the investigatory fo-
cus will be on the first formant.
Procedure
Our data come from a 20-second X-ray film of
a Swedish male speaker (for speech materials,
see Table 1). The images portraying a midsagit-
tal articulatory profile were sampled at 50
frames/sec, (see Branderud et al.1998 for de-
tails on the method).
Table 1. The speech material. Parentheses indicate
sounds not included in the analysis.
Vowel b-context d-con. g-con.
/i/ /ibi(pi:l)/ /di/ /i/
// /b(p:k)/ /d/ //
/a/ /da(s)/ /a(st)/
// /b(p:r)/ /d(l)/ /(l)/
/o/ /obo(po:l)/ /do(lk)/ /o(lv)/
/u/ /ubu(pu:l)/ /du(s)/ /u(s)/

The final part of the b-words, as well as the fi-
nal consonant in the other words, was not in-
cluded in the analysis; nonetheless, its presence
announcing itself aforehand in the preceding
vowel. We will return to this in the Results.

Tracings of all acoustically relevant struc-
tures were made using the OSIRIS software
package (University of Geneva) Contours de-
fined in Osiris were converted into tables with
x-and y-coordinates using PAPEX (Branderud
2002), calibrated in mm and corrected for head
movements (palate contour fit). For the tongue,
the contours were further processed by redefin-
ing them in a jaw-based coordinate system and
by resampling them at 25 equidistant flesh-
points, which were fed into a Principal Com-
ponent Analysis (PCA), providing the numeri-
cal specification of the tongue shapes (see
Lindblom 2003 for details on the method).

Articulatory parameters
In the classical work on vocal tract modeling
(Fant 1960) the location of the main constric-
tion, the size of that constriction and the length
and the opening area of the lip section have
been shown to be important determinants of the
output formant pattern. Bearing such findings
in mind we included the parameters listed in
Table 2 in our analyses.
The measurements were made from tracings
of midsagittal profiles of the subjects vocal
tract. The lip section was described in terms of
three parameters: the vertical separation of the
lips, the location of the mouth corner relative to
the upper incisors and the degree of protrusion
of the lips relative to the mouth corner. Jaw
opening is defined as the position of the lower
incisors relative to their location in clench.
The tongue contours were specified in terms
of Principal Components derived from a sample
based on 411 tongue contours. The input to this
analysis was a matrix in which columns corres-
ponded to the 25 fleshpoints and the rows con-

A) x-ray profile
G) Formant frequencies

Table 2. Articulatory parameters
Articulatory parameters Description/calculation
Vertical separation of lips Midsagittal distance
Jaw opening IncInf rel to clench
Lip length (Ulip+Llip)/2 IPC
Location mouth corner IPC-IncSup
Tongue contour 2 Principal Components
Larynx height Vertical distance IncSup-Lx
Pharynx back wall I slope
Pharynx back wall II intercept

tained information identifying individual ton-
gue contours. Since the specification of each
fleshpoint requires two numbers (x & y), there
were twice as many rows as contours. Accor-
dingly, the data fed into the PCA was an 822-
by-25 matrix. This format had the convenience
of automatically sorting the PCA output into
two sets: one consisting of the horizontal
weights (for the x coordinates) and the other
containing the vertical weights (for the y
coordinates). In the present multi-regressions
we limit the description of the tongue to four
numbers derived from the horizontal and the
vertical sets of the first two Principal Compo-
nents.
Larynx height was measured as the vertical
distance between the horizontal plane through
the tip of the upper incisor and the horizontal
plane through the posteriormost point of the
larynx contour.
Since movements of the back wall of the
pharynx would affect the posterior vocal tract
volumes and there was evidence of some pha-
ryngeal movement we approximated the back
wall with a straight line and used its slope and
intercept.
Analysis
Manual pulse-by-pulse measurements of the
first three formants were performed in
Soundswell 4.00.30 (Hitech Development AB),
using the Spectrogram Tool with FFT-points of
128/512, bandwidth of 250 Hz, Hanning win-
dow of 8 ms. The synchronization between the
acoustic data and the x-ray film was done nu-
merically by finding the best line-up between
the time points of acoustic segment boundaries
and the synch pulses corresponding to the indi-
vidual images. The error of this procedure is
estimated at a few msec.
Lastly, a multiple regression analysis
(MRA) was performed (using Excel Analysis
Tool-pack) on the acoustic and the articulatory
data, F1 data serving as the dependent variable
and the several articulatory parameters as the
independent variables.
Results
In Fig.3 frequency measurements for the first
formant are shown plotted against jaw move-
ment data for all words, revealing a poor corre-
lation between the acoustics and the articula-
tion, thus seemingly confirming earlier findings
about the non-monotone nature of the articula-
tory-acoustic relationship.

Figure 3. F1 vs. JAW, all words.
Examination of single words is helpful for
understanding some of the sources of the poor
correlation.

Figure 4. Time variations for F1 and JAW in
/b(pk)/. F1 is seen to be independent of jaw po-
sition during the stop closure which has a lowering
effect on F1 but leaves the jaw trace at a more or
less fixed value.
In Fig.4 the time variations of F1 and JAW
are shown for /b(pk)/. In the occlusion
(bordered by red vertical lines), we find a clear
discrepancy between acoustics and articulation:
F1 takes a dive while the JAW shows hardly
any movement at all.
/b(pk)/
0
200
400
600
800
1000
0,000 0,100 0,200 0,300 0,400
time (sec)
F
1

(
H
z
)
/b(pk)/
0
2
4
6
8
10
0,000 0,100 0,200 0,300 0,400
time (sec)
J
A
W

(
m
m
)
0
200
400
600
800
0 2 4 6 8 10 12
Jaw vertical movements (mm)
F
1

(
H
z
)

0
200
400
600
800
0 200 400 600 800
from jaw
from all art.par's
2
0
200
400
600
800
0 200 400 600 800
from jaw
from all art.par's
/(lv)/
0
200
400
600
800
1000
0,000 0,100 0,200 0,300 0,400
time (sec)
F
1

(
H
z
)
/(lv)/
0
2
4
6
8
10
0,000 0,100 0,200 0,300 0,400
time (sec)
J
A
W

(
m
m
)
/da(s)/
0
200
400
600
800
1000
0,000 0,100 0,200 0,300 0,400
time (sec)
F
1

(
H
z
)
/da(s)/
0
2
4
6
8
10
0,000 0,100 0,200 0,300 0,400
time (sec)
J
A
W

(
m
m
)

Figure 5 Time variations for F1 and JAW (left col-
umn: /da(s)/; right column: /(lv)). Here is a
second example of F1:s independence of the jaw. In
both words the jaw is raised without any major ef-
fect on F1.

Fig.5 gives a closer view of the anticipation of
the final consonant. Here F1 stays put, while
the JAW exhibits a steep rise presumable in
preparation for the following consonants, /s/
and /l/ respectively. We know that jaw posi-
tions tend to be high compared to those of vo-
wels. In particular the articulation of /s/ de-
mands a high and steady jaw (Keating et al
1994).

Figure 6. Observed F1 vs. MRA-predicted F1

These examples help us understand why jaw
position alone is a poor predictor of F1 fre-
quency. They also suggest the use of MRA.
Fig.6 compares predictions based on the jaw
alone (gray symbols) with MRA results (solid
circles). The left diagram shows how MRA im-
proves the correlation score for b-words. The
improvement (from r
2
=0.17 to =0.73) is mainly
due to the fact that the drastic F1 lowering gets
linked to the lips reaching closure (cf Fig 4).
The right diagram illustrates the corresponding
results for g-words. Here the improvement
(from r
2
=0.31 to =0.83) occurs because, despite
the decrease in jaw opening, F1 can remain
high in the context of tongue blade elevation (cf
Fig 5). How well are formants predicted by the
present shortcut method? Too early to give a
final answer since our results are limited to pre-
liminary observations of F1. Suffice it to say
that with carefully motivated selection of arti-
culatory information further improvements
seem possible.
References
Branderud P, Lundberg J-J, Lander J, Djam-
shidpey H, Wneland I, Krull D &
Lindblom B (1998): X-ray analysis of
speech: Methodological aspects, Proceed-
ings of FONETIK 98 (Dept of Linguistics,
SU) 168-171.
Branderud P (2002): Papex software, Dept of
Linguistics, Stockholm University.
Cortes E. (forthc.) Mapping articulatory para-
meters on the formant pattern. Dept of Lin-
guistics, Stockholm University.
Ericsdotter C (2005): Articulatory-Acoustic Re-
lationships in Swedish Vowel Sounds, Ph D
diss, Department of Linguistics, Stockholm
University.
Fant G. (1960) Acoustic Theory of Speech Pro-
duction Mouton: Hague.
Heinz J. M & Stevens K N (1964): On the De-
rivation of Area Functions and Acoustic
Spectra from Cinradiographic Films of
Speech, 67th ASA meeting: 1037-1038.
Keating, P.A., Lindblom, B., Lubker, J., and
Kreiman, J. (1994) Variability in jaw
height for segments in English and Swedish
VCVs, J Phonetics 22:407-422.
Lindblom B & Sundberg J (1971): Acoustical
consequences of lip, tongue, jaw and larynx
movement, J Acoust Soc Am 50:1166-
1179.
Lindblom B (2003): A numerical model of
coarticulation based on a Principal Compo-
nent analysis of tongue shapes, Proc XV
th

ICPhS, Barcelona.
Stevens K N (1998): Acoustic Phonetics, MIT
Press.
Stark J, Lindblom B & Sundberg J (1996):
APEX an articulatory synthesis model for
experimental and computational studies of
speech production, TMH-QPSR, 37(2):45-
48.

(Re)use of place features in voiced stop systems:
Role of phonetic constraints
Bjrn Lindblom
1
, Randy Diehl
3
, Sang-Hoon Park
3
and Giampiero Salvi
2

1
Dept of Linguistics, Stockholm University SE 10691 Stockholm
2
KTH, Dept of Speech Music and Hearing, SE 10044 Stockholm
3
Dept of Psychology, University of Texas at Austin, Austin, Texas 78712, USA

Abstract
Computational experiments focused on place of
articulation in voiced stops were designed to
generate optimal inventories of CV syllables
from a larger set of possible CV:s in the pres-
ence of independently and numerically defined
articulatory, perceptual and developmental
constraints. Across vowel contexts the most sa-
lient places were retroflex, palatal and uvular.
This was evident from acoustic measurements
and perceptual data. Simulation results using
the criterion of perceptual contrast alone failed
to produce systems with the typologically widely
attested set [b] [d] [g], whereas using articu-
latory cost as the sole criterion produced in-
ventories in which bilabial, dental/alveolar and
velar onsets formed the core. Neither perceptual
contrast, nor articulatory cost, (nor the two
combined), produced a consistent re-use of
place features (phonemic coding). Only sys-
tems constrained by target learning exhibited
a strong recombination of place features.
Introduction
The simulations were aimed at modeling the use
and re-use of place features in voiced stop in-
ventories. We addressed two issues: First what
explains the predominance of labial [b], den-
tal/alveolar [d] and velar [g] in the worlds lan-
guages (Maddieson 1984)? Second why do all
languages systematically re-use phonetic at-
tributes (Clements 2003)? In other words, why
are phonetic forms phonemically coded?
This work is an extension of Liljencrants &
Lindbloms model of vowel systems (1972).
The program developed by Giampiero Salvi
systematically selects subsets of CV sequences
from a larger universal inventory of CV syl-
lables. It evaluates all possible systems in terms
of an optimization criterion. This criterion
quantifies how distinctive the syllables in the
subset are (perceptual contrast), how difficult
they are to produce (articulatory cost) and
how difficult they are to learn (learning cost).
An optimal subset is identified as the one that
minimizes the sum of the subsets cost/contrast
scores. The articulatory cost metric was devel-
oped from bio-mechanical measures of the ar-
ticulatory representations of the syllables. De-
gree of perceptual contrast was defined on the
basis of experimental data on listener confusions
and distance judgments (Park 2007). Two in-
terpretations of the end state of phonetic learn-
ing were studied: The first assumes that the ob-
ject of phonetic learning is to acquire dynamic
units (gestures as in articulatory phonology));
The other hypothesizes that the learning of
phonetic movements involves the mastery of
timeless via points (targets) and the mecha-
nism of motor equivalence (ability to reach
motor goals from arbitrary initial conditions
(Lashley 1951)).
Place and perceptual contrast
To provide empirical and independent motiva-
tion for assumptions used in the simulations
several subprojects were undertaken.
Park (2007) investigated the problem of de-
fining perceptual contrast of voiced stops. A
phonetically trained speaker produced 35 CV
syllables in which the place of the stop was var-
ied in seven steps ranging from bilabial, dental,
alveolar, retroflex, palatal, velar to uvular. The
vowel was [i] [e] [a] [o] or [u].
Perceptual judgments were collected from
four groups of phonetically nave subjects
whose native languages are English, Hindi,
Korean and Mexican Spanish. There were five
subjects per language group. Their task was to
identify the syllables which were presented at
four signal/noise conditions (no noise, +5, 0, -5
dB). The effect of native language was weak or
absent which motivated the use of pooled data.
Confusion matrices tested negatively for re-
sponse bias effects. The matrix with the pooled
data was symmetrized using a method due to
Shepard (1972).

On the basis of acoustic measurements and
listener responses acoustic and perceptual dis-
tance matrices were derived. In a comparison
of several acoustic measures of distance with the
perceptual distances derived from the confu-
sions it was found that the acoustic variable with
the strongest predictive power was the for-
mant-based distances. Including also burst and
formant rate distances improved correlations
further (Park 2007).
Place and articulatory cost
From the viewpoint of biomechanics the ar-
ticulatory cost of moving between two arbitrary
configurations a and b, in a fixed time interval,
should be strongly related to the distance be-
tween them. This reasoning led us to quantify
the cost of a CV syllable as follows:
2 2
( , ) [ ( , )] [ ( , )] A C V dist rest C dist C V = + (1)
This formula says that A(C,V), the cost of a
given CV, is the sum of the consonant onsets
displacement from rest plus the vowel end-
points displacement from the onset. The use of
squared values is intended to reflect the physio-
logical fact that the relationship between muscle
length and muscle force is non-linear (cf
force-length diagrams). This measure was ap-
plied to articulatory representations of the 35
syllables.
Lacking direct articulatory measurements of
the recorded CV syllables, we used a subset of
about 500 tracings of X-ray profile images of a
single Swedish speaker available from other
projects (Branderud et al 1998). This corpus was
searched for representative vowels and strongly
constricted configurations at points of articula-
tion ranging from dental to uvular. Data on the
rest position were also included. It was defined
in terms of the articulatory configuration
adopted during quiet breathing. The final selec-
tion consisted of images of [i e a o u] sampled in
stressed syllables near the vowel midpoints and
configurations representing dental, alveolar,
retroflex, palatal, velar and uvular closures. To
facilitate comparisons between the contours,
they were resampled at 25 equidistant flesh-
points.
In applying Equation (1) to these represen-
tations articulatory distances, dist(rest,C) and
dist(C,V), were computed as the
root-mean-square of the inter-fleshpoint dis-
tances.
The main findings were: (i) The proposed
cost measure ranks places with respect to in-
creasing cost as follows: bilabial, dental, velar,
alveolar, palatal, uvular & retroflex; (ii) It
captures the notion of assimilation success-
fully pairing front vowels with anterior conso-
nant onsets and back vowels with posterior
consonant onsets. The first finding is related in
part to defining the cost measure as deviation
from rest, in part to identifying rest with the
articulatory settings of quiet breathing: a raised
jaw; closed lips; a fronted tongue creating a
more open posterior vocal tract facilitating
breathing through the nose. The second result is
linked to the use of Eq (1).
Although the cost measure is a first ap-
proximation, the predicted preferences show
good agreement with typological data on the use
of place in stops. The worlds languages use 17
target regions from lips to glottis (Ladefoged &
Maddieson 1996). Irrespective of inventory size,
nearly all (over 99%) of UPSIDs 317 languages
(Maddieson 1984) recruit three places of ar-
ticulation: bilabial, dental/alveolar and velar in
stops.
The findings are also in good agreement with
observations of infant speech production which
show a strong tendency for alveolar closures to
co-occur with front vowels, velar with back
vowels and labial with central vowels (Davis &
MacNeilage 1995).
Gestural or target-based control?
Is adult speech production gesture- or tar-
get-based? We argue that taking a stand on this
issue also implies taking a stand on what the end
state of phonetic learning is. Do children learn
gestures or targets? The simulations were set up
to reflect those two possibilities.
In recent times the traditional target theory
of speech has not gone unchallenged. Support
for dynamic gestures as basic units comes from
experimental data indicating that visual and
auditory systems are more sensitive to changing
stimulus patterns than to purely static ones.
There is also evidence from speech perception
experiments (Strange 1989).
The problem that such observations pose for
a target theory of speech is that, if perception
likes change, why assume that the control of
speech production is based on static targets?
Should not what a talker controls in production
be what the listener wants in perception?
We argue that the fact that dynamic proper-
ties of speech are important in perception should

not in any way rule out the possibility that
speakers might use a sparse representation of
speech movements. While rejecting a target
theory of speech perception appears justified,
dismissing a target theory of speech production
appears premature.
This conclusion becomes clear when the
formal definition of gesture is examined. The
standard reference is to the work by Saltzman &
Munhall (1989). Their task-dynamic model is
often described as using an input of gestural
primitives. However, the fact that gestures are
formally defined in terms of point attractors
reveals that the notion of target is actually part
of their technical definition.
Targets and phonetic learning
We conclude from these considerations that
there is significant support for assuming that
adult speech processes are target-based and are
set up to generate motor equivalent behavior. In
other words, within its work space, the system
generates the movement between A (an arbitrary
current location) and B (an arbitrary movement
goal) and does so for situations requiring new
compensatory motion paths.
If these processes are part of the adult
speakers phonetic competence they must
somehow be acquired by the learner. We sug-
gest (i) that in development targets are the re-
sidual (least action) end products of matching
the response characteristics the speech effectors
to the dynamics of the ambient speech; and (ii)
that the movement paths (transitions) between
targets are handled by the general mechanism of
motor equivalence.
These assumptions lead to the following
corollary: Once a target has been learned in
one context, it can immediately be re-used in
other contexts, since motor equivalence han-
dles the trajectory for the new context.
The above set of hypotheses will be referred
to as target learning. A form of gestural learn-
ing will also be included in the simulations. It
will be interpreted to mean acquiring gestures
holistically.
Computational experiments
We here consider the set of possible CV:s to
consist of 35 syllables although languages in
principle have an uncountable number to choose
from. The goal of the simulations is to investi-
gate subsets of the 35 CV items by ranking them
according to an optimality criterion.
Building on Liljencrants & Lindbloms
model we developed the criterion with the fol-
lowing components:
perceptual contrast D(S) is a global measure of
perceptual dissimilarity based on the pairwise
dissimilarity D(i,j) of any syllables s
i
and s
j
be-
longing to the system S;
articulatory cost A(i) is the cost of each syllable
s
i
belonging to S;
learnability is a measure of the effort required to
learn system S. It is based on the number of
consonant onsets w and vowel endpoints z that
the syllables belonging to system S share.
The criterion to optimize is:
(2)
1/D(S)
2
corresponds to the definition of contrast
used by Liljencrants & Lindblom. When con-
trast does not contribute this term is equal to 1.
Learnability r(i) can assume the forms: r(i) =1
for gestural learning, or r(i)= wz for target
learning.
The rationale for adding the r(i) term is as
follows: The childs attempt to imitate and
spontaneously use a given phonetic form comes
up against dealing with the articulatory com-
plexity of that form. As imitation attempts are
repeated sensory references are established.
When a given sensory experience is recorded it
automatically gets linked to a motor reference
(assuming that the learner has a neural mirror
system, that is, a perceptual/motor link). With
more practice this motor reference is strength-
ened. Accordingly during the course of the
learning a sort of copying takes places. How-
ever, it is copying only in a non-trivial sense
since some patterns are easier than others (read:
they differ in terms of articulatory cost A(i)). So
speed of acquisition is affected by that cost. The
gestural and target-based approaches influence
that speed in different ways. In the gestural
mode forms are acquired at a rate inversely re-
lated to articulatory complexity. More com-
plexity means more practice. Target learning
modifies that rule. Because targets once learned
in one context can be re-used in new contexts,
practice will modify the score of all syllables
containing that target. Target learning therefore
implies more rapid learning than gestural
learning. The term r(i) controls that speeding up
by measuring how many times a given system
re-uses a given target. By definition target in-
formation is stored independently of context.
When a given target is practiced all potential
( ) ( )
1
score S 1/ ( ) / ( )
N
i
D S A i r i
=
=

2


combinations using it will benefit. It appears
reasonable to assume that this would also be true
of flesh-and-blood phonetic learning. The key to
the re-use phenomenon is motor equivalence
and the context-free nature of targets.
Results
Across vowel contexts the most salient places
were retroflex, palatal and uvular. This was
evident from acoustic measurements and per-
ceptual data. Perceptual contrast alone reflected
that fact in failing to favor systems with the ty-
pologically widely attested set [b] [d] [g].
On the other hand, using articulatory cost as
the sole criterion produced inventories in which
bilabial, dental/alveolar and velar onsets formed
the core.
Neither perceptual contrast, nor articulatory
cost, (nor the two combined), produced a con-
sistent re-use of place features (phonemic cod-
ing). Only systems constrained by target
learning exhibited a strong recombination of
place features.
Implications
A comprehensive discussion of the present
findings is found in Lindblom et al (in press).
Our research supports explaining the typologi-
cal preference for labial, dental/alveolar and
velar in terms of a theoretically motivated
measure of ease of articulation. It further sug-
gests that phonemic coding may also have ar-
ticulatory origins (the interplay between discrete
motor target representations and motor equiva-
lence) to which languages have adapted during
the course of history. The results do not throw
doubts on perceptual contrast playing a role in
shaping sound systems. Rather they suggest
phonetic factors operating in interaction. Per-
haps the most novel aspect of the work is the fact
that phonetic explanations were proposed not
only for substantive aspects (place preferences)
but also for formal facts such as the recombina-
tion of place features (phonemic coding). Here
the remarks of Martinet (1968:483) come to
mind: In so far as such combinations are easy
to realize and to identify aurally, they should be
a definite asset for a system: for the same total
of phonemes, they require less articulations to
keep distinct; these articulations, being less
numerous, will be the more distinct; each of
them being more frequent in speech, speakers
will have more occasions to perceive and pro-
duce them, and they will get anchored sooner in
the speech of children.
References
Branderud P, Lundberg H-J , Lander J ,
Djamshidpey H, Wneland I, Krull D &
Lindblom B (1998). "X-ray analyses of
speech: Methodological aspects, in
FONETIK 1998, KTH, Stockholm.
Clements G N (2003): Feature economy in
sound systems, Phonology 20:287-333.
Davis B L & MacNeilage P F (1995): The ar-
ticulatory basis of babbling, J Speech Hear
Res 38:1199-1211.
Ladefoged P & Maddieson I (1996): The sounds
of the worlds languages, Oxford:Blackwell.
Lashley K (1951): The problem of serial order
in behavior, pp 112-136 in J effress L A
(ed): Cerebral mechanisms in behavior,
Wiley:New York.
Liljencrants J & Lindblom B (1972): Numeri-
cal simulation of vowel quality systems: The
role of perceptual contrast, Language
48:839-862.
Lindblom B, Diehl R, Park S-H & Salvi G (in
press): Sound systems are shaped by their
users: The recombination of phonetic sub-
stance, to appear in Nick Clements & Ra-
chid Ridouane (eds): Where do features
come from? The nature and sources of pho-
nological primitives.
Maddieson I (1984): Patterns of sound, Cam-
bridge:CUP.
Martinet A (1968): Phonetics and linguistic
evolution, in Malmberg B (ed): Manual of
phonetics, 464-487, Amster-
dam:North-Holland.
Park S-H (2007): Quantifying perceptual con-
trast: The dimension of place of articulation,
Ph D dissertation, University of Texas at
Austin.
Saltzman E L & Munhall K G (1989): A dy-
namical approach to gestural patterning in
speech production, Ecological Psychology
1(4):333-382.
Shepard R N (1972): Psychological represen-
tation of speech sounds, in P.B. Denes & E.
E. David J r. (eds.) Human communication: A
unified view, 67-113, New York,
McGraw-Hill.
Strange W (1989): Evolving theories of vowel
perception. J Acoust Soc Am 85(5):
2081-2087.

On the Non-uniqueness of Acoustic-to-Articulatory
Mapping
Daniel Neiberg and G. Ananthakrishnan
Centre for Speech Technology, CSC, KTH (Royal Institute of technology), Stockholm, Sweden

Abstract
This paper studies the hypothesis that the
acoustic-to-articulatory mapping is non-
unique, statistically. The distributions of the
acoustic and articulatory spaces are obtained
by minimizing the BIC while fitting the data
into a GMM using the EM algorithm. The kur-
tosis is used to measure the non-Gaussianity of
the distributions and the Bhattacharya distance
is used to find the difference between distribu-
tions of the acoustic vectors producing non-
unique articulator configurations. It is found
that stop consonants and alveolar fricatives are
generally not only non-linear but also non-
unique, while dental fricatives are found to be
highly non-linear but fairly unique.
Introduction
The acoustic-to-articulatory (A-to-A) mapping,
also known as articulatory inversion has been
of special interest to speech researchers for
quite some time. It deals with estimating or re-
covering vocal tract shapes from the acoustics
of an utterance. It remains one of the funda-
mental problems in understanding speech pro-
duction. The inversion has several applications,
namely low bit-rate encoding, training visual
agents like talking heads, improving articula-
tory speech synthesis, and robust speech recog-
nition. Research in this topic has shown sub-
stantial progress in terms of using several ma-
chine learning techniques to minimize the error
between the true vocal tract shape and the
shape estimated using knowledge of the acous-
tics. Ouni et al. (2005) and Roweis (1997) have
performed inversion using code books and dy-
namic programming while Richmond (2006)
has done extensive research on performing the
mapping using mixture density Neural Network
(NN) regression. Hiroya (2004) and
Katsamanis et al. (2007) have used Hidden
Markov Models (HMM) with one or more
states per phoneme, in order to perform the in-
version. Toda et al. (2008) have used a Gaus-
sian mixture model (GMM) along with Maxi-
mum Likelihood Estimation (MLE) smoothing
for dynamic features. These methods have been
extremely successful at minimizing the average
root mean square error (RMSE) over all articu-
latory channels and maximizing the Pearsons
Correlation Coefficient (PCC) between the ac-
tual vocal tract configuration and the predicted
ones.
The errors in the mapping are often attrib-
uted to the non-uniqueness of the inversion
known as fibers in the articulatory space. By
non-uniqueness, it is meant that multiple vocal
tract configurations can produce almost the
same acoustic features. Early research pre-
sented some interesting evidence corroborating
non-uniqueness. Bite-block experiments
showed that the speakers were capable of pro-
ducing sounds perceptually close to the in-
tended sounds even though the jaw was fixed in
an unnatural position. As shown by Gay and
Lindblom (1981) the lossless tube models of
the vocal tract also indicate the possibility of
this non-uniqueness. Qin et al. (2007) per-
formed, possibly, the first empirical investiga-
tion into the non-uniqueness problem. They
quantized the acoustic space using the percep-
tual Itakura distance on LPC features. The ar-
ticulatory space was modeled using a non-
parametric Gaussian density kernel with a fixed
variance. For the phonemes where the articula-
tory distribution was found to be multi-modal,
the authors had concluded that non-uniqueness
existed. They found non-uniqueness for certain
phonemes like //, /l/ and /w/, while there
seemed to be a unique mapping for certain
other phonemes.
By definition, a mapping is said to be non-
unique if more than one articulatory position
can produce exactly the same acoustic features.
In real continuous speech, however, the possi-
bility of finding two data points with exactly
the same acoustic parameters is abysmally
poor. In order to simplify the problem, the
acoustic space is quantized. If two data points
within this quantization range fall sufficiently
apart in the articulatory space, then the map-
ping is said to be non-unique. However, a result
obtained in this manner can be quite mislead-
ing, since this kind of an effect could be caused
due to insufficient resolution of quantization


and data sparseness. In an attempt to provide
answers to these issues a new model based
paradigm is proposed, which tries to quantify
non-uniqueness. In the following sections, a
study of the nonlinearity and non-uniqueness of
the A-to-A mapping is presented.
Data
The experiments conducted use the Electro-
magnetic Articulography (EMA) data from the
MOCHA database (Online:
http://www.cstr.ed.ac.uk/research/projects/artic
/mocha.html, accessed on 23 J an. 2007.) for a
male and female speaker. The acoustic features
are (D =16) MFCC from the acoustic input and
the articulatory features are positions of the
EMA coils. The x- and y-positions of 7 coils
are available, which means that there are a total
of d =14 articulator channels.
Method
The proposed method is illustrated in Figure 1.
It is based on unsupervised clustering of the
data points in acoustic space (X) for each pho-
neme, which partition the acoustic data in dis-
tinct Gaussian clusters. Then the data points in
articulatory space (Y) corresponding to each
acoustic subspace (modeled by a Gaussian), are
clustered again using the same technique. If the
clustering in the articulatory space generates
multiple modes, then it is a sign of non-linear
mapping. If the data points corresponding to
different modes in the articulatory subspaces
are all sampled uniformly from the same Gaus-
sian in the acoustic subspace, then it is a sign of
non-unique mapping. The details of this algo-
rithm are given below.
The clustering procedure uses a model
based approach which fits the given data into a
GMM, in such a way that every Gaussian
represents a cluster in the acoustic space. This
is achieved using the Expectation Maximiza-
tion (EM) algorithm to obtain parameters that
are Maximum Likelihood (ML) estimates
(McLachlan, 2000). The Schwartz or Bayesian
Information Criterion (BIC), is minimized, in
order to find the optimum number of clusters in
the acoustic space. Every phoneme p in the cor-
pus is modeled by a Gaussian mixture model,
p
, containing K
p
clusters. K
p
is chosen by
minimizing BIC.
Non-linearity
For the data points belonging to the k
th
acoustic
Gaussian, X
k
p
,the corresponding articulatory
subspace is modeled by an optimal number of
Gaussians, R
k
p
, using the same method. The ar-
ticulatory vectors belonging to the r
th
such
Gaussian are given by Y
k
p
(r).
If there exists only one Gaussian in the ar-
ticulatory space for the k
th
Gaussian in the
acoustic space (i.e. R
k
p
=1), then the distribu-
tion of the articulatory vectors can be predicted
by performing a linear transform on the Gaus-
sian distribution of the acoustic vectors. How-
ever, if R
k
p
>1, then it means that this sort of a
linear transform cannot be performed. The less
normal the articulatory space is, the more non-
linear the mapping. The non-linearity of Y
k
p
is
calculated by using the Mardias multi-variate
kurtosis (Mardia, 1970) for goodness of fit to a
normal distribution. This measure NL (Non-
Linearity) is the proposed measure of non-
linearity. It takes the value of 0 for true Gaus-
sian distribution and a positive higher value for
a non-Gaussian distribution. It is important to
note here, that observing multi-modality in the
distribution of the articulatory features corre-
sponding to a single mode Gaussian of acoustic
features does not necessarily imply non-
uniqueness, but it necessarily implies non-
linearity. The authors want to stress this point
here, because it is easy to confuse the multi-
modality with non-uniqueness. A more strin-
gent measure is necessary to imply non-
uniqueness.
Acoustic Space Articulatory Space
1
2
3
X Y
Y
A

X
A
X
B
Y
B

Y
A

X
A&B

Y
B

Figure 1: Three hypothetical examples of sub-
space mapping. 1) The mapping is linear 2) The
mapping is non-linear and unique 3) The map-
ping is non-linear and non-unique.


weight
)
Non-uniqueness
Consider the Gaussian acoustic space; X
k
p
.
X
k
p
(r) is a subset which corresponds to one
mode of the articulatory space, Y
k
p
(r). There are
two possibilities. The first possibility is that,
this subset does not have a Gaussian distribu-
tion. In such a case, it may be possible to find a
non-linear mapping between each of these sets
of data to the corresponding mode in the articu-
latory space. But if this part has the same Gaus-
sian distribution as the whole single Gaussian,
i.e., if the distributions (X
k
p
(r)) and (X
k
p
) have
exactly the same parameters, then it connotes
that the data points with exactly the same dis-
tribution in the acoustic space can actually pro-
duce articulatory features with different distri-
butions. This is the necessary and sufficient
condition to imply non-uniqueness. In order to
find out the similarity between the distributions
(X
k
p
(r)) and (X
k
p
), the Bhattacharya distance
is used. However, there is no accurate method
of calculating this distance for unknown distri-
butions. Non-parametric distribution estimates
would suffer from a data sparseness problem.
So, we use the Bhattacharya distance assuming
a Gaussian distribution but weigh it by the
Gaussianity of the data. Non-Gaussianity is de-
termined by the kurtosis of the data points. We
bias it so that it takes the value of 1 for a per-
fect Gaussian distribution and is higher for non-
Gaussians. Thus non-uniqueness, (NU
k
p
(r)), can
be defined as the inverse of the Bhattacharya
distance (D
bh
) weighed by the measure of its
Gaussian nature:

K
m
(.) denotes the multi-variate kurtosis of the
data points. Thus, NU is lower for clusters with
unique mapping.
Results
Figure 2 shows an example of non-uniqness for
the articulatory and acoustic subspaces for pho-
neme /t/. Data points belonging to different
clusters in the articulatory space seem to belong
to the same distribution in the acoustic space.
The most discriminating features in the acoustic
space are shown to the right, while the actual
articulatory positions are shown to the left. We
can see that, the same data points have different
distributions in the articulatory space and al-
most the same distribution in the acoustic
space. This is a sign of non-uniqueness. Figure
3 shows a plot of the articulatory and acoustic
sub-spaces for phoneme /l/. Data points from
the acoustic space on the whole seem to take a
Gaussian distribution. But they do not take a
Gaussian distribution in the articulatory space.
For every cluster in the articutory space, the
acoustic distribution is a non-Gaussian in the
acoustic space. This shows that the mapping is
non-linear, but still rather unique. Figure 4
shows a comparative study of the non-linearity
(NL) and non-uniqueness (NU) for a few pho-
nemes in the database, for the male and female
speaker. The mean NL
p
and NP
p
for phoneme p
are calculated by weighing them with the
factor of the respective Gaussians. We
can see that most consonants exhibit non-
uniqueness while most vowels have a small de-
gree of non-linearity. The stop consonants /t/,
/d/, /k/ and /p/ and fricatives like /s/, /z/ has a
higher degree of non-uniqueness. The fricatives
// and // are highly non-linear but rather
unique, while liquids such as /l/, // and // are
found to be rather non-linear, but unique by the
method used in this paper. While the non-
linearity and non-uniqueness of stop conso-
nants is expected due to the silence region, the
reason why alveolar fricatives show high non-
uniqueness could be because the EMA data
may not have adequate detail to measure the
exact location of the tongue tip. There are con-
siderable variations in the levels of non-
uniqueness and non-linearity shown for the
male and female speakers, but the trends are
Figure 2: Plot showing the articulatory and
acoustic subspaces for phoneme /t/.
1000
( ) ( ) ( )) ( ( ), ( 1 ) (
1
) (
r X X D r X K
r NU
k
p
k
p bh
k
p m
k
p
+
=
2000 1000 0 1000 2000 3000 4000 5000 6000 7000
3000
2500
2000
1500
1000
500
0
500
Articulator x position mm
A
r
t
ic
u
a
t
o
r

y

p
o
s
it
io
n

m
m
50 45 40 35 30 25 20
0
2
4
6
8
10
12
14
16
LDA dimension 1
L
D
A

d
im
e
n
s
io
n

2

cluster 1
cluster 2
Upper lip
Lower lip
Velum
Tounge
Jaw

in the articulatory and
the acoustic subspaces.
h. Studies with
methods than EMA, must be carried out to validate
the results obtained from this paper.
me
for Research of the European Commission, under
FET-Open contract no. 021324.
Ou
-
Ro
arning smooth maps to re-
.,
Ric
tic-
Hir
roduction Model.
Ka
ry Speech
Tod
pectrum using a Gaus-
-
Ga
ustic
.
Qin. C., Carreira-Perpinan. M. A (2007) An
rp.
McLachlan, G., and D. Peel (2000) Finite Mix-
Mardia, K. V. (1970) Measures of multivariate
skewnees and kurtosis with applications,
Biometrika, 57(3):519530.
more or less similar. Figure 3 and 4 show cor-
responding data points
Conclusions and future work
This work proposes a method to distinguish be-
tween non-linearity and non-uniqueness. It suggests
measures to quantify the same, and analyzes the
non-linearity and non-uniqueness of different pho-
nemes in the database for two speakers. Phonemes
such as /t/ and /p/ are found to be non-unique, while
other phonemes like // and // are found to be non-
linear, but rather unique, from our studies. Future
work can be directed in trying to estimate the best
possible non-linear estimators for clusters with high
nonlinearity and further constraints or allowances to
tackle the non-uniqueness of the mapping. Work
must be done on defining a non-uniqueness crite-
rion for any general distribution. Knowing which
articulators contribute more to the non-uniqueness
could be another direction for researc
1000
Acknowledgements
The authors acknowledge the financial support of
the Future and Emerging Technologies (FET) pro-
gramme within the Sixth Framework Program
References
ni, S. and Laprie, Y. (2005) Modeling the
articulatory space using a hypercube code
book for acoustic-to-articulatory inversion.
In J . Acoust. Soc. Am., 118(1):444460.
weis, S. (1997) Towards articulatory speech
recognition: Le
cover articulator information, Eurospeech
3:12271230.
hmond, K. (2006) A Trajectory Mixture
Density Network for the Acous
Articulatory Inversion Mapping. In Proc.
ICLSP., 577- 580, Pittsburgh.
oya. S. (2004) Estimation of Articulatory
Movements from Speech Acoustics using
an HMM-based Speech P
In Trans. IEEE on Speech and Audio Proc-
essing., 12(2):175185.
tsamanis. A., Papandreou. G., Maragos. P.
(2007) Audio-visual to Articulato
Inversion using HMMs. In Proc. Multime-
dia Signal Processing, 457460.
a. T., Black. A. W., Tokuda. K. (2008) Sta-
tistical mapping between articulatory move-
ments and acoustic s
sian mixture model. In J . Speech Communi
cation 50:215227.
y, T.; Lindblom, B.; Lubker, J . (1981) Pro-
duction of bite-block vowels: Aco
equivalence by selective compensation. In J
Acoust. Soc. Am. 69: 802810.
Emperical Investigation of the Nonunique-
ness in the Acoustic-to-Articulatory Map-
ping. In Proc. Interspeech, 7477, Antwe
ture Models. J ohnWiley and Sons, New
York.
Figure 3: Plot showing the articulatory and
acoustic subspaces for phoneme /l/.
2000 1000 0 1000 2000 3000 4000 5000 6000 7000
3500
3000
2500
2000
1500
1000
500
0
500
Aatriculator x position mm
A
r
t
i
c
u
l
a
t
o
r

y

p
o
s
i
t
i
o
n

m
m
Upper lip
Lower lip
Velum
Tounge
Jaw
15
10
5
0
5
10
40
15
35 30 25 20
LDA dimension 1
L
D
A

d
i
m
e
n
s
i
o
n

2
15
Figure 4: Graph showing the mean non-linearity
and non-uniqueness for selected phonemes in the
MOCHA database for the male and female
speaker.
Pronunciation in Swedish encyclopedias: phonetic
transcriptions and sound recordings
Michal Stenberg
Dept. of Linguistics and Phonetics, Centre for Languages and Literature, Lund University
Abstract
This paper presents work in progress, aimed at
a doctoral dissertation on displaying pronunci-
ation to the users of general encyclopedias,
particularly those in Swedish language. Vari-
ous types of phonetic notations are studied and
compared; also some pronunciation dictionar-
ies are taken into account. The problems of
finding an optimal way of presenting pronunci-
ation to users are scrutinized and discussed.
Furthermore, so-called audio pronunciations,
i.e. recordings of read words in digital encyclo-
pedias, are treated from several points of view.
Introduction
When consulting encyclopedias, getting hold of
the pronunciation of an entry can be quite a
tricky affair, because of the multitude of phon-
etic notations being used in various works.
Some of these, such as the renowned Encyc-
lopdia Britannica, published in the U.S.A., do
not submit pronunciation data at all.
mong Swedish general encyclopedias,
there is a long-established tradition of e!plain-
ing to users how to pronounce words esteemed
difficult. "ustomarily, this is done by way of
phonetic transcriptions# since the advent of dig-
ital media, also sound recordings, $audio pro-
nunciations%, have been made use of.
Scope and method of study
&his study, which is intended to lead to a 'h.(.
thesis, will focus on phonetic notation systems
used in encyclopedias, particularly in Swedish
language. &he systems will be compared with
each other and also with some notations used in
pronunciation dictionaries)not only Swedish
ones)and evaluated. So-called audio pronun-
ciations, which are becoming increasingly fre-
quent in digital reference works as a comple-
ment to phonetic transcriptions, will be studied
separately.
&he survey method is qualitative# by means
of questionnaires, a panel of encyclopedia users
will be consulted about their attitudes, e!pecta-
tions and preferences with regard to display of
pronunciation.
n important issue will be that of optimi*-
ing pronunciation transcriptions. +or someone
consulting a reference book, it takes a certain
effort to read up on the present transcription
system. &his effort ought to be in proportion to
the benefit users get from it.
,ther ma-or issues that will be handled are
pronunciation editors% evaluation of sources of
information, their choice of language varieties,
$lects%, to be transcribed.recorded and their de-
cisions about what phonetic features)at pro-
sodic as well as segmental level)should be
submitted in various types of works.
What pronunciation to display?
/n the editing process of a reference book, an
essential issue)from several aspects)is de-
ciding what kind of pronunciation transcrip-
tions should be based on. +or Swedish, a rough
consensus seems to e!ist, although some phon-
emes, like .r/, and its combinations with fol-
lowing dentals may give rise to controversies.
/n the Swedish language community, a
small one with a relatively high general level of
education, people are supposed to pronounce
many loanwords and foreign proper names in a
fairly source-language-like way.
+or e!ample, it would be stigmatizing to
pronounce pommes frites or Colgate as if they
were ordinary Swedish words: [pmsfrits],
[klgt]. Rather, the latter ought to be pro-
nounced in its conventional quasi-English way:
[klgt], and the former in the French way:
[pmfrit]. 0owever, in recent decades, the el-
lipsis pommes, with the low-prestige pronunci-
ation [pms], has emerged.
/n Sweden, a person with an academic edu-
cation not mastering at least one foreign lan-
guage can hardly be imagined. 'rior to the ab-
olition of the studentexamen (roughly compar-
able to the French baccalaurat) in 1968, at
least two foreign languages were compulsory in
secondary schools. Overall, there is a social
pressure that creates a demand for information
about correct pronunciation in reference
books and dictionaries.
Choice of lects of foreign languages;
native vs. adapted pronunciation
part from settling what variety 1accent2 of
Swedish should be used as a base for transcrip-
tions, the problem of handling languages with
more than one ma-or variety, e.g. English,
Spanish and 'ortuguese, are to be dealt with.
,n the one hand, native speakers of these lan-
guages normally use their own pronunciation
wherever they are and whatever is their topic.
,n the other hand, when presenting the name
of a person alive, it is usually a matter of cour-
tesy for encyclopedias to report his.her own
preferred pronunciation. /n general, imposing
to the bearer of a name a pronunciation totally
alien to him.her comes)like misspelling it)
close to being rude.
3ut as time goes by, frequently mentioned
names, even personal ones, usually undergo the
same pronunciation changes as loanwords do.
&his is the case for 3eethoven 1see &able 42.
5otably enough, 6.S. 3ach keeps his 7erman
pronunciation [bax] in Sweden)in spite of [x]
having no phonematic status in Swedish)but
has become [bak] in (enmark and [bk] or
[bx] in the 8.9. /n such cases, publishing in
the first place the adapted 1swedici*ed2 pronun-
ciation seems to be a good rule of thumb.
Table 1. IPA transcriptions of Beethoven as habitu-
ally pronounced in some languages. In the Danish
example, the apostrophe ( ) denotes std.
7erman [bet!"#$%&
Swedish [b't()*+]
Danish [bet!",-./$&
British English [b01t2(34*+]
American English [b01t)4*+]
French [b0t)*+], [b0t*+]
Russian [b1tx)*+]
What notation to use?
&he transcriptions in Swedish general reference
books published since the end of the 19th cen-
tury are of many different kinds: some printed
worksparticularly older onesemploy only
letters of the Swedish alphabet, others add a
few special signs, and still others resort to a
more or less extensive IPA notation, not seldom
modified in some respects. In major Swedish
encyclopedias there is a tendency over the last
century to approach regular IPA, although a re-
luctance to accept the IPA way of marking
stress seems to remain. This may be due to a
solid tradition among monolingual Swedish
glossaries etc. to use an acute accent () to sim-
ultaneously indicate primary stress and quantity
(of vowels, in the first place). The acute accent
is not merely used in bracketed transcriptions,
but also in entry headwords. Since in Swedish,
quantity and stress are linked together, and
vowel and consonant length are in complement-
ary distribution, this system is economical and
operational as far as purely Swedish pronunci-
ation is concerned. The accent sign is placed
after the letter(s) representing the long sound,
e.g. kajak, konjak, pollett, thus eliminating
the problem of syllabification. However, when
it comes to rendering pronunciation of more
genuinely foreign words, the system proves to
be less suitable.
Table 2. Phonetic transcriptions of Fontainebleau
in the Swedish encyclopedias Nordisk Familjebok
(NF), Svensk Uppslagsbok (SvU), Nationalencyklo-
pedin (NE), Bertmarks Respons (BR) and Bonniers
Lexikon (BL). Years of publication in parentheses.
5+ 14:;<=>>2 [f566t7+bl58]
NF (190426)
[f)+
g
t7+
9
b:;8&
SvU (194755)
[f5
<
t7+bl5=]
NE (198996) [#>t?$b:"@&
BR (19978) [#A6t?$b:"&
BL (19938) [#>t?$b:B&
Table 3. Phonetic transcriptions of Michelangelo
in the same works as in Table 2
5+ 14:;<=>>2 [mik0la=+CD0l)]
NF (190426) [mik0la=+CD0l5]
SvU (194755) [mikela+=j el]
NE (198996) [mik0la=+CE0l)]
BR (19978) [mik0la+CE0l)]
BL (19938) [mik0l?+CE0l)]
/n addition, (en Store (anske Encyklopdi
14>>@=ABB42 provides [mikelndlo]Cin its
notation based on the Dania systemCas an
/talian pronunciation but in the other case -ust
inserts an /' stress mark in the headwordD
Fontainebleau; seemingly, a certain familiar-
ity with +rench pronunciation is e!pected from
the users of this (anish AB-volume work.
Should prosodic features other than
stress be rendered?
'rosodic features other than stress, like accent 4
and A in Swedish and its equivalents in 5orwe-
gian, or the (anish stEd, are seldom rendered in
the notations of general reference books. &he
reason for this may be twofoldD the phenomena
mentioned are of minor importance for under-
standing an utterance, and their reali*ation and
geographical spread vary widely. 0owever, for
entries in Standard "hinese, it would be quite
feasible for a vast encyclopedia to supply the
four tones, as does for e!ample (uden us-
sprachewFrterbuch in its later editions.
ptimi!ing notations
(ue to lack of space, a single-volume sports
dictionary cannot go into detail about pronunci-
ation in the way a full-fledged encyclopedia
can. 5either is it likely that its users are willing
to address themselves to a complicated system
in order to e!plore the minute details of a
word%s pronunciation.
"ow narrow should a transcription be?
n encyclopedia, in contrast to a language dic-
tionary, might contain words from a great num-
ber of languages# for practical reasons, a com-
mon notation system must be used. /deally, this
should be capable of conveying a phonemic
rendering of all the languages. &his creates a
dilemmaD if the transcription system is made
too narrow in order to fulfil the needs of one
language, in others it will necessitate irrelevant
choices between allophones.
#ange of individual phonetic symbols
compromise solution of the problem men-
tioned would be to widen the range assigned to
each phonetic symbol and use the signs some-
what differently for transcribing different lan-
guages. &his requires some well-chosen e!-
amples in the introduction chapter but should
be a viable way of obtaining reasonably good
transcriptions of many of the original lan-
guages.
n alternative way would be to show re-
spelled pronunciations, in analogy with those
often found in U.S. reference works, even ex-
tensive ones. However, as mentioned above,
the linguistic situation in Sweden is quite dif-
ferent from that in the United States, where
strongly anglicized pronunciations of almost all
foreign words are widely accepted.
#espelled pronunciations
Interesting examples of respelled pronuncia-
tions are found in Olausson and Sangster
(2006) and its predecessor, BBC Pronunciation
dictionary of British names (1983). Here, the
respelling systems are more condensed than its
U.S. counterparts and are presented together
with IPA transcriptions. This allows for conve-
nient use by a wide range of people. The re-
spelled pronunciations convey a rather angli-
cized version; the IPA transcriptions, aimed pri-
marily at users familiar with foreign languages,
are more true to the languages of origin, though
still somewhat anglicized.
$udio pronunciations
Encyclopedias that are web-based or on "( or
(G( often offer users audio pronunciations
1sound recordings of read entry headwords2 as a
complement to phonetic transcriptions. &he
production of such recordings bring some of
the above issues to a head.
What languages to record?
/t often proves practically impossible to make
recordings in all languages figuring among the
headwords of an e!tensive encyclopedia. Either
a limited number of languages can be chosen
for recordings by native speakers, or)if adap-
ted, e.g. swedici*ed, pronunciation is used)a
lot of languages, though rarely all, can be
handled by one or two speakers.
"ow to choose spea%ers?
Whether native or domestic speakers are to be
used, selecting them is a delicate task. 3esides
linguistic skill and suitable voice, speaking
style and age have to be taken into account.
Even though a certain variation is desirable, the
speakers must not be too disparate.
Coaching of spea%ers
When recording in a studio, speakers reading
lists of words tend to use a prosody that reveals
that they are)e!actlyH)reading a list, disreg-
arding the fact that users will listen to each
word as an isolated one. "oaching by a trained
phonetician is advisable.
Conclusion
,ne of the main concerns of the editorial staff
of an encyclopedia or other reference book is
putting itself in the place of the notional users.
&his applies not least to pronunciation editors
and others responsible for displaying pronunci-
ation. 0opefully, this survey)once completed
)will make for useful and easily accessible
pronunciation data to all those curious in search
for it.
#eferences
BBC Pronouncing dictionary of British names,
2nd edn. (1983), Pointon, G.E. (ed.). Ox-
ford: Oxford Univ. Press
Catford J.C. (1988) A practical introduction to
phonetics. Oxford: Oxford Univ. Press
Duden, Aussprachewrterbuch, 6th edn., re-
vised and updated (2005). Mannheim: Du-
denverlag
Garln C. (2003) Svenska sprknmndens ut-
talsordbok. Stockholm: Svenska sprk-
nmnden, Norstedts ordbok
International Phonetic Association (1999)
Handbook of the International Phonetic As-
sociation: guide to the use of the interna-
tional phonetic alphabet. Cambridge, U.K.:
Cambridge Univ. Press
Ladefoged P. and Maddieson I. (1996) The
sounds of the worlds languages. Oxford:
Blackwell
Lindblad P. (1980) Svenskans sje- och tje-lud i
ett allmnfonetiskt perspektiv. Ph.D. thesis.
Lund: C.W.K. Gleerup
Molbk 0ansen, '. 14>>B2 8dtaleordbog.
7yldendahls rEde ordbEgerD (ansk udtale.
"openhagenD 7yldendalske 3oghandel and
5ordisk +orlag .S
,lausson, I. and Sangster, ". 1ABB<2 ,!ford
33" 7uide to pronunciation. ,!fordD ,!-
ford 8niv. 'ress
Pullum G.K. and Ladusaw W.A. (1986 [2nd
edn. 1996]) The phonetic symbol guide.
Chicago: The Univ. of Chicago Press
Warnant L. (1994) La prononciation franaise
dans sa norme actuelle. Paris and Gem-
bloux, Belgium: Duculot
Wells J.C. (2008) Longman pronunciation dic-
tionary, 3rd edn. Harlow, U.K.: Pearson Ed-
ucation Ltd.

EXPROS: Tools for exploratory experimentation with
prosody
Joakim Gustafson and Jens Edlund

Centre for Speech Technology, KTH Stockholm, Sweden

Abstract
This demo paper presents EXPROS, a toolkit for
experimentation with prosody in diphone
voices. Although prosodic features play an im-
portant role in human-human spoken dialogue,
they are largely unexploited in current spoken
dialogue systems. The toolkit contains tools for
a number of purposes: for example extraction
of prosodic features such as pitch, intensity and
duration for transplantation onto synthetic ut-
terances and creation of purpose-built custom-
ized MBROLA mini-voices.
Introduction
This demo paper presents EXPROS, a graphical
toolkit permitting us to experiment with pro-
sodic variation in diphone synthesis in an effi-
cient manner.
Prosodic features such as pitch, intensity
and duration play an important role for many of
the aspects of spoken dialogue that are central
to human-human dialogue. Still, to date they
are rarely exploited in human-computer dia-
logues. Examples of areas that would benefit
from the inclusion of more prosodic knowledge
include interaction control, the management of
turn-taking, interruptions, and backchannels;
attitude towards what is said, such as the signal-
ling of uncertainty or certainty; prominence,
such as contrastive focus and stress; and
grounding, as in brief feedback utterances for
verification and clarification.
On the perception side, there is a fair body
of research into these matters from the spoken
dialogue system point of view. Some of these
results have been taken as far as to implementa-
tion and experimentation in full-blown spoken
dialogue systems. On the production side, there
are fewer examples where our knowledge of
prosody has made it all the way to full-blown
systems. In current spoken dialogue systems,
pre-recorded prompts or unit selection synthesis
are often chosen because of their superior voice
quality. The drawback is that these techniques
make it difficult to vary prosody and to control
this variation in any detail, so few examples of
experimentation with such variations exist. One
of the few examples is Raux & Black (2003),
which also provides an overview of the topic.
There is a large body of studies of prosodic fea-
tures using re-synthesis with modified prosody
(using e.g. Praat) and with HMM synthesis, but
the results have proven difficult to implement
in real on-line systems.
Other synthesis methods formant synthe-
sis and diphone synthesis provide greater con-
trol over prosodic features. The relatively low
voice quality of formant synthesis makes it un-
suitable for many user studies, however, and
diphone synthesis suffers from the relatively
large cost of recording the required diphones, as
well as from less-than-perfect voice quality.
Before going into the functionality currently
built into the toolkit, lets discuss a few of its
applications. Our main reason to experiment
with prosodic variation is to make spoken di-
alogue systems that more closely mimic hu-
man-human dialogue, in order to better exploit
its strengths. This need not be the case for all
spoken dialogue system design, but it is our
motivation here. The following are three exam-
ples of increasing complexity of dialogue needs
that EXPROS aim to meet.
Interaction control
A key area where humans excel over current
spoken dialogue systems is interaction control,
the management of the flow of the dialogue, for
example turn-taking and interruptions. An oft-
mentioned problem is that of user barge-ins, but
we would also want our systems to be able to
deal with system barge-ins and self-interrupts in
a better manner. The dialogue excerpts in Ex-
ample 1 exemplify this. In order for a spoken
dialogue system to produce the behaviours
listed above, the systems processing in its en-
tirety needs to be incremental, as noted in Allen
et al. (2001) and Aist et al. (2006). Here, how-
ever, we are only concerned with being able to
control the rendering of the speech sounds suf-
ficiently to produce utterances like the ones in
Example 1.

Example 1: Three dialogue excerpts

U Whats the weather like in Stockton?
S The weather in Stockholm? Wait a mo* [*ment, Ill look it up]
U No, I said Stockton

U Any news on fashion /SIL/ in Tibet?
S OK, le* [*t me see what I can do]
S Ah, let me see what I can do

U Any news about Camden market?
S Let me see... no, theres no* [*thing new at the moment]
/fresh news arrive/
S Oh, wait, theres a fire in Camden!

The sounds of dialogue
In order to achieve this kind of dialogue, we
need to be able to test variations in perception
tests as well as in real human-computer dia-
logue situations. To do this, we need to be able
to record the required prompts with different
prosody, at the very least. In many cases, we
may want to record new diphones in the ex-
ample above, for example, we could record
P*_SIL diphones, that go from a phoneme P to
silence SIL abruptly, to make the interruptions
sound more realistic. Recording extra sets of
diphones for hypo- and hyper-articulated
speech may also be useful, as well as affective
speech, for example angry or despondent. Test-
ing out new voices can be very time consuming,
however, as a Swedish diphone voice typically
contains some 5000 diphones. This is far too
expensive for exploratory studies into the ef-
fects of prosodic and voice quality variations.
Instead we can create mini-voices voices with
few diphones, that are able to produce only a
limited number of utterances, but that are easy
to record and to modify.
Incremental development
Finally, pre-recorded prompts, unit selection
synthesis, and diphone synthesis all suffer from
the need to enrol the original speaker each time
the voice is to be extended or changed. A di-
phone voice production is furthermore often
created in one go, and rarely updated or
changed after its completion. We attempt to
make it possible for speakers who are not the
original speaker to do as many extensions as
possible particularly to record new prosodic
patterns, and also for the voice creation to be
done incrementally, by making it simple to add
new diphones and diphone sets when they are
needed.
Prompts and voices developed in EXPROS
can be used in perception tests, either of stand-
alone prompts or of re-synthesised dialogue ut-
terances, but most importantly they are intended
for use in interactive experiments, where the
pragmatics the actual effect prosodic variation
has on the interaction can be measured.
The EXPROS Toolkit
The toolkit uses the Snack sound toolkit
1
as its
backbone, and integrates functions from a
number of existing tools, such as the Mbrola
engine and database builder
2
, a PC-KIMMO
3

morphological dictionary, NALIGN forced
alignment (Sjlander & Heldner, 2004),
/nailon/ prosodic extraction and normalisation
(Edlund & Heldner, 2006), etc.

Text processing: Reading and management of
(prosodic) labels in the orthographic input.
These labels could be used to generate prosodic
patterns automatically, such as increased stress
or prolonged syllables.

Grapheme to phoneme conversion: The tool-
kit currently incorporates automatic transcrip-
tion using PC-KIMMO and a Swedish diction-
ary with transcribed morphs, an NALIGN
CART tree built on Centlex, a Swedish pronun-
ciation dictionary developed at the Centre of
Speech Technology, as well as a set of cooar-
ticulation rules (over word boundaries) built

1
http://www.speech.kth.se/snack/
2
http://tcts.fpms.ac.be/synthesis/mbrola.html
3
http://www.sil.org/pckimmo/

into NALIGN. In addition, user lexica can be
defined and used.

Automatic speech alignment: The toolkit uses
the forced aligner NALIGN to extract phone
start and end times from recordings.

Automatic prosody parameter extraction:
For prosodic analysis, the toolkit can currently
use the methods built into the Snack sound
toolkit (ESPS get_f0 and AMDF pitch extrac-
tion as well as power analysis, which can be
used to estimate spectral tilt). The normaliza-
tion methods built into /nailon/ are also avail-
able.

Modification of prosodic parameters: The
toolkit provides a number of methods for modi-
fication of prosodic parameter curves as well as
creation of new curves. These include direct
manipulation in a GUI, stylisation, normalisa-
tion and transformation to another speakers
speaking style, model generated prosodic
curves, and transplantation of curves from re-
cordings.

Diphone synthesis: The toolkit uses an ex-
tended MBROLA synthesis engine (Drioli et
al., 2005) which adds control of for example
gain, spectral tilt, shimmer and jitter to render
audio. Using a combination of the components
listed above, the toolkit also gives the possibil-
ity to automatically generate the data needed to
build new MBROLA diphone databases, and
some scripts to make on-the-fly modifications
to how the MBROLA engine select diphones.
Next steps
A number of experiments and investigations
using EXPROS are underway:
We will test the effects of transplanted
prosody on perceived synthesis quality. Pre-
liminary listening tests suggest that transplant-
ing durations, intensity and pitch from human
recordings onto the diphone synthesis makes
diphone voices sound considerably better as a
whole, which is promising. We also want to test
this in the context of the findings of Hjalmars-
son & Edlund (in press), where synthesised ut-
terances containing typical features of human-
human dialogue, such as filled pauses and repe-
titions, were investigated.
The EXPROS tool has recently been used to
improve the subjective ratings of a bad speaker,
by re-synthesising 30 seconds of speech with
increased pitch variation and speaking rate
(Strangert & Gustafson, submitted). We intend
to do more experiments with resynthesis in or-
der to explore the limits of what can be ex-
pressed by manipulating prosody alone.
Furthermore, the toolkit has proven valuable
for verifying the quality of automatic prosodic
analysis pitch and intensity extraction as well
as phone durations by listening to the original
recording and its resynthesis in parallel a
method inspired by Malfrere & Dutoit (1997).
Finally, we are in the process of running
tests were subjects use EXPROS to create new
versions of very brief feedback or clarification
utterances in order to change their meaning. We
have previously shown that monosyllabic words
can be understood as positive or negative
grounding on the perceptual or understanding
levels by manipulating their pitch contour
(Edlund et al., 2005), and using EXPROS, we
hope to be able to show the same for multisyl-
labic compound words.
Acknowledgements
Thanks to everyone who has put hard work on
developing the publically available tools that
are used in this toolkit. Special thanks to
Thierry Dutoit (MBROLA) and Kre Sjlander
(Snack/NALIGN). This work was supported by
the Swedish research council project #2006-
2172 (Vad gr tal till samtal/What makes
speech special) and MonAMI, an Integrated
Project under the ECs Sixth Framework
Program (IP-035147).
References
Aist, G., Allen, J. F., Campana, E., Galescu, L.,
Gmez Gallo, C. A., Stoness, S. C., Swift,
M., & Tanenhaus, M. (2006). Software Ar-
chitectures for Incremental Understanding
of Human Speech. In Proceedings of Inters-
peech (pp. 1922-1925). Pittsburgh PA,
USA.
Allen, J. F., Ferguson, G., & Stent, A. (2001).
An architecture for more realistic conversa-
tional systems. In Proceedings of the 6th in-
ternational conference on Intelligent user
interfaces (pp. 1-8).
Drioli, C., Tesser, F., Tisato, G., & Cosi, P.
(2005). Control of voice quality for emo-
tional speech synthesis. In Proceedings of
AISV 2004 (pp. 789-798). Padova, Italy.
Edlund, J., & Heldner, M. (2006). /nailon/ -
software for online analysis of prosody. In

Proc of Interspeech 2006 ICSLP. Pittsburgh
PA, USA.
Edlund, J., House, D., & Skantze, G. (2005).
The effects of prosodic features on the in-
terpretation of clarification ellipses. In Pro-
ceedings of Interspeech 2005 (pp. 2389-
2392). Lisbon, Portugal.
Hjalmarsson, A., & Edlund, J. (in press). Hu-
man-likeness in utterance generation: ef-
fects of variability. To be published in Pro-
ceedings of the 4th IEEE Workshop on Per-
ception and Interactive Technologies for
Speech-Based Systems. Kloster Irsee, Ger-
many.
Malfrere, F., & Dutoit, T. (1997). Speech Syn-
thesis for Text-to-Speech Alignment and
Prosodic Feature Extraction. In Speech Syn-
thesis for Text-to-Speech Alignment and
Prosodic Feature Extraction", F.
Malfr�re & T. Dutoit, Proceedings
of the International Symposium on Circuits
and Systems (pp. 2637-2640,).
Raux, A., & Black, .. (2003). A Unit Selection
Approach to F0 Modeling and its Applica-
tion to Emphasis. In Proceedings of ASRU
2003, St Thomas, US Virgin Islands..
Sjlander, K., & Heldner, M. (2004). Word
level precision of the NALIGN automatic
segmentation algorithm. In Proc of The
XVIIth Swedish Phonetics Conference, Fo-
netik 2004 (pp. 116-119). Stockholm Uni-
versity.
Strangert, E., & Gustafson, J. (submitted). Sub-
ject ratings, acoustic measurements and syn-
thesis of good-speaker characteristics. Sub-
mitted to Proceedings of Interspeech 2008.
Brisbane, Australia.


Presenting in English or Swedish:
Differences in speaking rate
Rebecca Hincks
Department of Speech, Music and Hearing
KTH

Abstract
This paper attempts to quantify differences in
speaking rates in first and second languages, in
the context of the growth of English as a lingua
franca, where more L2 speakers than ever be-
fore are using English to perform tasks in their
working environments. One such task is the
oral presentation. The subjects in this study
were fourteen fluent English second language
speakers who held the same oral presentation
twice, once in English and once in their native
Swedish. The temporal variables of phrase
length (mean length of runs in syllables) and
speaking rate in syllables per second were cal-
culated for each language. Speaking rate was
found to be 23% slower when using the second
language, and phrase length was found to be
24% shorter.
Introduction
As English continues its growth as a lingua
franca, more and more speakers across the
world find themselves in front of an audience
that needs to hear the speakers message in a
language that neither speaker nor listener is en-
tirely comfortable with. One reason for the dis-
comfort can be traced to the extra time it takes
to formulate ones message in a second lan-
guage (L2). Slower English speakers in busi-
ness meetings can have difficulty taking the
floor from native speakers (Rogerson-Revell,
2007) and international students may be frus-
trated by their ability to formulate responses
quickly enough to contribute to classroom dis-
cussion (J ones, 1999). Though researchers have
begun to explore the effect of L2 language use
in interactive situations such as the meeting or
the seminar, the ramifications of slower L2
speaking rate when holding an instructional
monologue, such as a presentation or a lecture,
have not been explored.
Conveying information to an audience in an
L2 can be a difficult experience for many rea-
sons. Teachers complain that they are less able
to be spontaneous, but they may not understand
themselves that they require a bit more time to
produce the same linguistic content. In general,
little is known about how the use of a second
language affects vital issues such as the
speakers ability to engage the audience and to
adequately cover the intended content in the
time allotted for the presentation or lecture.
Temporal featuresparticularly speaking
ratecan have an influence on both abilities.
Temporal variables have previously been
explored from the L1 perspective, the L2 per-
spective, and various interfaces between them.
The work that has been done has been carried
out in an attempt to study the cognitive proc-
esses underlying linguistic production (Gold-
man-Eisler, 1968), to understand language ty-
pology (Grosjean & Deschamps, 1973), to sup-
port a theoretical model for the process of sec-
ond language acquisition (Towell, Hawkins, &
Bazergui, 1996) or for tools in language as-
sessment (Rekart & Dunkel, 1992). The pre-
sent study is motivated by other needs that
could be described as pragmatic rather than
theoretical. We are now in a situation, at least
in Europe, where more speakers than ever be-
fore are carrying out their daily business in a
second language, English. The fact that speak-
ers speak more slowly in a second language
may be obvious but it is not trivial in the glob-
alizing world. The question asked here is sim-
ply how much speakers can be slowed down by
working in a second language.
This research builds on earlier work
(Hincks, 2005a and 2005b) which examined a
smaller database of five speakers making dual
lingual presentations. Those five speakers form
part of this study as well, but their recordings
have been augmented with nine new speakers
to create a more reliable subject group. The
first study looked at two primary variables:
speaking rate and pitch variation. The present
study omits pitch variation, saving that aspect
for a future study. The 2005 results showed
large differences in speaking rate, which have
been confirmed by testing on a larger group.


Method
Working at the syllable rather than word level
is a necessity for any kind of cross-linguistic
study; although Swedish and English are
closely related languages, they use different
orthographic conventions. An assumption is
made that the information content of syllables
is equivalent when comparing genetically re-
lated languages such as English and Swedish.
This study uses the second rather than minute
as the length of time, and speaking rate (SR) is
thus expressed in syllables per second (sps).
Another variable that has been found to be
relevant in the study of speaking rate is what is
known as the mean length of runs (MLR)
what could also be called phrase length, or the
amount of speech, in syllables, between pauses.
The MLR will generally be shorter in L2
speech than in L1 speech (Kormos and Dnes
2004), and in that way give an indication of the
frequency of pauses in the speech. Different
pause lengths have been used to define the
boundaries of the phrases, but most studies
have used a length between 200 and 300 milli-
seconds. This study uses a length of 250ms, or
one quarter of a second.
The fourteen subjects for this study, six
women and eight men, were all Swedish native
students of engineering at KTH, taking an elec-
tive course in Technical English. They had
taken a written diagnostic test upon application
to the language department, and had been
placed in either the Upper Intermediate (B2+)
(10 subjects) or Advanced classes (C1) (4 sub-
jects). The English oral presentations were re-
corded in the second half of the 56-hour
courses, so that students had had plenty of time
to warm up any rusty spoken English. The
Swedish oral presentations were made outside
of class, using the same visual material and be-
fore a smaller audience.
The 28 presentations were carefully tran-
scribed in a three-step process. First the entire
presentation was orthographically transcribed,
including filled pauses. Speech recognition was
a helpful tool in the English transcriptions. The
speaker-dependent dictation software Dragon
NatSpeak 9 was trained to the researchers
voice, who then repeated the presentations into
the dictation program. A complete, though
somewhat inaccurate, transcription could be
produced in real time10 minutes for a 10-
minute presentation. Listening to the presenta-
tion two or three more times allowed for cor-
rection of the inaccuracies and addition of the
filled pauses that the speech recognition is
trained to ignore. The vocabulary of the dicta-
tion software was impressive, including Swed-
ish place names and rare words such as types of
pharmaceuticals and phenomena (e.g. quantum
teleportation). Swedish personal names were,
however, more problematic.
The second phase of transcription, which al-
lowed further correction to any eventual inac-
curacies, was to break the transcriptions into
phrases, using pauses as boundaries. The
speech waveform was used to locate all silent
or filled pauses longer than 250 milliseconds.
Finally, in the third phase of transcription,
each phrase was broken into syllables by insert-
ing spaces to represent syllable boundaries.
Filled pause markings were first removed so
that they would not be counted as syllables.
The total number of syllables was divided by
the length of the presentation in seconds to find
the speaking rate.
Results
Table 1 presents the mean length of runs, the
total number of syllables, the total speaking
time, and speaking rate.
Phrase length (MLR)
All speakers used shorter phrase lengths in
English than in Swedish. The means were
12.59 syllables per phrase in L1 and 9.51 sylla-
bles per phrase in L2, a mean difference of
3.08, SD 2.15. This shorter length in L2 is sta-
tistically significant: t (13) =3.10, p <.01,
one-tailed. The phrase lengths by speaker cor-
relate strongly between languages: R=0.82.
Speaking rate
The mean SR for L1 was 3.89 sps (SD .61), and
in L2 3.12 sps (SD .46). The slower speaking
rate in L2 is statistically significant: t (13) =
3.438, p <.01, two tailed. This can also be ex-
pressed as a mean difference of 20.8%, where
L2 is 23% slower than L1, and L1 is 18.7%
faster than L2. As expected, all speakers spoke
more quickly in L1: at least 3 sps, with three
speakers approaching a speaking rate of 5 sps.
In L2 the rates range from a low of 2.37 sps to
a high of 4.12 sps. The SRs between languages
correlate strongly, R=0.85. They also correlate
by speaker with phrase length: 0.82 for L1, and
0.89 for L2.


Table 1. The mean length of runs between pauses of >250 ms, the total number of syllables in the
presentation, the total seconds of speech, and the speaking rate in syllables per second.

Speaker
Mean length of
runs
(syllables)
Total Syllables
Total Time
(seconds)
Speaking Rate
(syll/s)
Swedish
L1
English
L2
Swedish
L1
English
L2
Swedish
L1
English
L2
Swedish
L1
English
L2
S1(M) 11.57 6.40 2244 2272 588 836 3.82 2.72
S2(M) 10.98 6.94 1383 1534 439 648 3.15 2.37
S3(M) 8.94 7.09 1225 1318 370 475 3.31 2.77
S4(F) 10.55 8.00 1889 1568 535 568 3.53 2.76
S5(M) 11.49 8.23 2367 1844 545 576 4.34 3.20
S6(F) 8.84 8.45 1538 1369 477 472 3.22 2.90
S7(M) 9.27 9.14 1660 1773 506 597 3.28 2.97
S8(M) 11.14 9.23 1437 1938 355 580 4.05 3.34
S9(M) 11.37 9.87 2411 1797 602 521 4.00 3.45
S10(M) 14.79 10.36 3934 2487 840 700 4.68 3.55
S11(F) 14.26 10.99 2737 2571 676 752 4.05 3.42
S12(F) 12.96 11.16 1361 1296 411 425 3.31 3.05
S13(F) 20.36 12 3502 2519 722 696 4.85 3.62
S14(F) 19.73 15.34 2229 2025 464 491 4.80 4.12
SD 3.62 2.38 820 446 138 119 .61 .46
Mean 12.59 9.51 2137 1879 538 596 3.89 3.12

Discussion
Both similarities and differences between the
L1 and L2 presentations have been revealed by
this examination of two temporal variables: the
amount of speech uttered between pauses
(MLR), and the speaking rate (SR), including
pauses, over 6-14 minutes. To begin with the
similarities, there is a strong effect of individ-
ual speaking style between the two languages.
The correlations between L1 and L2 of 0.82
(SR) and 0.85 (MLR) show that those speakers
who used shorter phrase lengths and slower
rates of speech in one language used them in
the other language as well, confirming previous
work done on laboratory speech (Deschamps,
1980; Raupach, 1980; Towell, Hawkins, &
Bazergui, 1996). Though other researchers
have suggested using phrase length to measure
fluency in second languages, it is important to
recognize that phrase length differs in ones
first language as well.

The main research issue addressed here was
an attempt to quantify the effect on speaking
rate of using an L2 in the oral presentation
situation. Using English instead of their native
language meant that all speakers had shorter
phrase lengths and slower rates of speech. On
average, using English slowed the speakers
down by 23%. The difference can be attributed
to the frequent short pausesas evidenced by
the shorter phrase lengthsthat are necessary
for the speakers to find the formulations they
need in L2. A long phrase length shows that
that linguistic knowledge has been procedural-
ized (Levelt, 1989; Towell, Hawkins, & Bazer-
gui, 1996). The subjects in this study, though
they were speaking about material they them-
selves had prepared and were fluent speakers of
English, show the degree to which operating in
a second language affects the cognitive proc-
esses underlying speech production.


Conclusion
Recommendations
The slower speaking rates shown in this study
do not necessarily pose a problem when the
speech in question is instructional speech.
When both speakers and listeners are operating
in a second language, a speaking rate of about 3
sps is probably appropriate. However, it is im-
portant for individual speakers and for policy-
makers to understand and acknowledge the ef-
fect of using a second language on speaking
rate, particularly when making a shift from do-
ing a task one normally does in L1 to doing it
in L2. If the rate of a delivery of a 45-minute
lecture is slowed down by 25%, then the lecture
will take closer to an hour to finish. Course
plans and schedules need to be adapted to ac-
commodate this, especially in light of the fact
that research has shown that students tend to
save their questions for after class when they
are themselves operating in an L2 (Airey &
Linder, 2006). Other measures that could be
considered would include variable speaker time
at conferences and other gatherings.
Further work
The next question to be asked in the study of
the dual-language presentation database is to
what extent using different languages affected
the content of the presentations. Is the slower
speaking rate a symptom of such linguistic dif-
ficulty that speakers omit information in L2
that they include in L1? It is beyond the scope
of the present paper to investigate this question
in detail, but it can be said that an initial study
comparing the propositional content of the
fourteen pairs of presentations finds a slight but
not overwhelming advantage for the L1, espe-
cially when the presentations are normalized
for time. Further differences appear in the
meta-discourse with which the speakers struc-
ture their presentations, and the extent to which
they elaborate on the content. These issues will
be the subject of forthcoming work.
References
Airey, J .& Linder, C. (2006). Language and the
experience of learning university physics in
Sweden. European J ournal of Physics 27,
553-560.
Hincks, R. (2005a). Presenting in English and
Swedish. Proceedings of Fonetik 2005
(Gothenburg University Department of Lin-
guistics) 45-48.
Hincks, R. (2005b). Computer Support for
Learners of Spoken English. Doctoral The-
sis. Royal Institute of Technology Stock-
holm.
Deschamps, A. (1980). The syntactical distribu-
tion of pauses in English spoken as a second
language by French students. In Temporal
Variables in Speech (pp. 255-262): Mouton.
Goldman-Eisler, F. (1968). Psycholinguistics.
Experiments in Spontaneous Speech. Lon-
don: Academic Press.
Grosjean, F., & Deschamps, A. ( 1973). Ana-
lyse des variables temporelles du francais
spontan II. Comparison du francais oral
dans la description avec langlais (descrip-
tion) et avec le francais (interview radio-
phonique). Phonetica 28, 191-226.
J ones, J . (1999). From Silence to Talk: Cross-
Cultural Ideas on Students' Participation in
Academic Group Discussion. English for
Specific Purposes, 18(3), 243-259.
Kormos, J ., & Dnes, M. (2004). Exploring
measures and perceptions of fluency in the
speech of second language learners. System,
32, 145-164.
Levelt, W. (1989). Speaking: from Intention to
Articulation. Cambridge: MIT Press.
Raupach, M. (1980). Temporal variables in first
and second language speech production. In
Temporal Variables in Speech (pp. 263-
270): Mouton.
Rekart, D., & Dunkel, P. (1992). The Utility of
Objective (Computer) Measures of the Flu-
ency of English as a Second Language. Ap-
plied Language Learning, 3, 65-85.
Rogerson-Revell, P. (2007). Using English for
International Business: A European case
study. English for Specific Purposes, 26,
103-120.
Towell, R., Hawkins, R., & Bazergui, N.
(1996). The Development of Fluency in
Advanced Learners of French. Applied Lin-
guistics, 17(1), 84-119.


Preaspiration and Perceived Vowel Duration in Norwegian
Jacques Koreman
1
, William J. Barry
2
and Marte Kristine Lindseth
1

1
Department of Language and Communication Studies, NTNU, Trondheim
2
Institute of Phonetics, Saarland University, Saarbrcken

Abstract
This article presents an experiment to investi-
gate the perceived duration of Norwegian vow-
els before [d] versus preaspirated [t]. It is
shown that preaspiration contributes to the
perceived duration of the vowel before [t]. The
general implications of this finding for phone
segmentation and for phonetic research using
vowel duration measurements are discussed.
Introduction
In a production study of American English,
Peterson and Lehiste (1960) found that vowel
duration including aspiration after phonologi-
cally voiceless or fortis plosives (308 ms) is
longer than after phonologically voiced or lenis
word-initial plosives (274 ms); if aspiration is
excluded, it is shorter (251 ms) on average in a
set of 68 minimal pairs. In a perception study
for German (results as yet unpublished), a cor-
responding effect on perceived duration was
shown: the vowel duration after a lenis plosive
was judged equal to the duration of the vowel
plus half of the aspiration after a fortis plosive.
Comparable to the production data above,
the vowel incl. preaspiration before a tense plo-
sive is longer than the vowel before a lax plo-
sive in preaspirating dialects of Norwegian
(Van Dommelen, 1999), while the vowel ex-
cluding preaspiration is shorter; but there is
substantial variation across dialects. In percep-
tion, the perceptual effect of preaspiration on
phonological categorization has been investi-
gated by Moxness (1997), who found that the
perceived (phonological) vowel length is not
affected by the presence or absence of preaspi-
ration in V:C versus VC: stimuli. Van Dom-
melen (1998) showed an effect of preaspiration
on the perception of plosives as fortis versus
lenis in Norwegian.
The present study investigates the perceived
phonetic vowel duration in stimuli containing
preaspirated versus fully voiced vowels. We
hypothesize that, similar to aspiration, preaspi-
ration influences the perceived vowel duration.
More specifically, our goal is to evaluate how
much of the pre-aspiration is perceived as part
of the vowel.
Method
We shall first discuss the selection of the stimu-
lus pairs and then describe their preparation for
the perception experiment, followed by a de-
scription of the experiment itself.
Selection of the stimuli
Ten repetitions of two sets of /CVC/ and
/CVC/ stimuli were recorded with all combi-
nations of /i,,u/ followed by /p,b; t,d; k,/ (and
with the same initial C in each minimal pair
differing in [voice] for the second consonant).
The stimuli were presented for reading in ran-
dom order on a computer screen using a
PowerPoint presentation, and were recorded
directly onto hard-disk in a studio, with a sam-
pling frequency of 44 kHz and a 16-bit ampli-
tude resolution. Comparison of the stimuli
showed that preaspiration after short vowels is
generally longer than after long vowels. The
vowel // showed no supraglottal friction,
which did sometimes occur with close vowels,
especially /i/. To maximize the presence of true
preaspiration, we selected batte-badde
from the list as the single stimulus pair for our
perception experiment. An additional consid-
eration was that these are both nonsense words,
so that the listeners are not affected by familiar-
ity or frequency of the stimuli.
Figure 1. Segmentation of vowel in batte into
modally voiced, breathy and preaspirated portions
To investigate how perceived vowel dura-
tion is affected by preaspiration, we used ten
repetitions of the stimuli spoken by a single,
male speaker from Stavanger. Figure 1 shows
an example of the segmentation of the vowel in
batte into modally voiced, breathy voiced and
modal br preasp

preaspirated portions, for which praat was
used. Since there was typically a breathy
voiced signal portion in the transition from mo-
dal voice to preaspiration, this was included as
a separate factor in the perception experiment.
Preparation of the stimuli
Two native Norwegian listeners (both MA stu-
dents of Phonetics) could not distinguish the
signal portions of the two sets of stimulus
words from the release to the end of the word.
Thus, batte and badde are differentiated by
the stressed vowel and the following closure.
The vowel of the nonsense words of the
batte type all consisted of modal voicing, fol-
lowed by breathy voicing and preaspiration.
The vowel in badde consisted entirely of mo-
dal voicing which continued into the closure of
the following /d/. The stimuli for the listening
test were adapted so that the durations of the
vocalic portions were carefully controlled:
Figure 2. Schematic diagram of the vowel du-
rations in six stimulus conditions (time axes in
the real stimuli are not comparable across
stimulus conditions)
Stimulus_0: the duration of the vowel in
badde equals that of modally voiced +
breathy voiced + preasipirated portions of the
vowel in batte.
All stimulus pairs were selected such (from
the ten repetitions) that the two vowels in a pair
were similar. If this was not possible, the two
stimuli were changed by deleting or adding sin-
gle glottal periods until the vocalic portions of
interest had more or less the same durations.
The following stimuli have increasingly
longer vowels (from vowel onset until the clo-
sure of the following consonant) in batte than
in badde:
Stimulus 1: the duration of the vowel in
badde equals that of the modally voiced +
breathy voiced + half of the preaspirated por-
tion in batte.
badde equals that of the modally voiced +
breathy voiced portions in batte.
badde equals that of the modally voiced por-
tion in batte.
Notice that in the last three stimuli, the
batte stimulus word will be longer than
badde, namely by the other half of the
preaspirated portion (which is still present!) in
stimulus 1, the whole preaspirated portion in
stimulus 2, and the breathy voiced +preaspi-
rated portions in stimulus 3.
In addition to the above three conditions
there are two conditions in which the vowel in
batte is relatively shorter:
Stimulus -1: the duration of the vowel in
badde is longer than the total vowel duration
in batte by the duration of the breathy voiced
portion in batte (but the vowel is modally
voiced throughout).
Stimulus -2: the duration of the vowel in
badde is longer than the total vowel duration
in batte by the duration of the breathy voiced
+ half of the preaspirated portion in batte
(but the vowel is modally voiced throughout).
We did not include a condition stimulus -3
(same as stimulus -2, but with an even longer
vowel in badde) because the vowel in that
case was always perceived as longer than the
total vowel in batte in preliminary listening.
Inclusion would have increased the number of
stimuli, without giving additional information.
Two stimulus sets were prepared: in set A, a
stimulus pair was selected for each condition
from the ten different realizations of batte
and badde which fulfilled its vowel duration
criteria (cf. Figure 2). The advantage of this
stimulus set is that the stimuli were produced
naturally, i.e. the (majority of the) stimuli did
not need to be manipulated (by dropping or
copying glottal periods) to obtain the vowel du-
rations according to the scheme in Figure 2.
The disadvantage of those stimuli is that in ad-
dition to the durational differences there may
be other factors which influence the perception
of vowel duration.
For this reason, we also created a set B, in
which only one stimulus pair was selected as a
basis for the perception experiment. The pair
was chosen so as to fulfill condition 0, i.e. the
vowel in badde had the same duration as the
total vowel duration in batte. To derive the
Stimulus 0
Stimulus 1
Stimulus 2
Stimulus 3
Stimulus -2
Stimulus -1
badde
batte
badde
batte
badde
batte
badde
batte
badde
batte
badde
batte

other conditions, the stimuli were manipulated
by inserting or deleting glottal periods from the
modally voiced portion of the vowel. The
batte and badde stimuli were manipulated
equally strongly (e.g. two glottal periods in-
serted into badde and two dropped from
batte for condition -1).
Perception experiment
Eight listeners listened to stimulus set A. The
same listeners, plus another four, listened to
stimulus set B. The listeners were all native
Norwegians between 25 and 60 years old, and
had no reported hearing problems.
The listeners task was to judge which of
the two stimuli in each pair was longest and re-
spond by checking the corresponding box on a
response form (on paper) to indicate their per-
ception. There was no equal duration choice,
since we wanted to prevent the listeners from
using this option too often. The listeners could
hear the stimulus pair over headphones as often
as they liked by moving the mouse over a loud-
speaker symbol in the PowerPoint presentation.
Mouse Over was used instead of Mouse
Click to prevent the disturbing sound of
mouse clicks.
The batte-badde pairs were offered in
both orders, balanced across ten lists (repeti-
tions). Within each list, the stimulus pairs were
offered 10 times in different pseudo-
randomized order. The pseudorandomization
consisted in ensuring that two consecutive
stimuli were more than 2 apart, i.e. Stimulus 1
could not be followed by Stimulus 0 or 2. The
total number of stimuli was 6 conditions x 2
orders x 10 repetitions =120 stimuli. Each list
was preceded by five and followed by three
filler items.
Results
The two versions of the experiment using
stimulus sets A and B led to substantially dif-
ferent results (see Figure 3). In general, the
upward trend in longer vowel responses from
Stimulus -2 to Stimulus 3 corroborates our hy-
pothesis. But for set A, the trend is weak and
the number of responses where the vowel in
batte is considered longer than that in
badde never exceeds 50%. That is, the vow-
els in the badde stimuli are always judged
longer than those in batte.
For set B, where the stimuli were all de-
rived from a single pair of stimuli, the trend is
much clearer, and shows a clear transition from
3% longer vowel responses for Stimulus -2 to
almost 94% for Stimulus 3.
Figure 3. Response percentage longer
vowel in batte for stimulus sets A and B
Differences between stimulus conditions
Separate analyses of variance for the two
stimulus sets showed that order of the stimuli
within the pair had no significant effect on the
perceived relative vowel duration in the two
stimuli, nor did order interact with stimulus
condition. The difference between the six
stimulus conditions, however, was highly sig-
nificant for both stimulus sets (set A:
F(0.001;5,84)=10,21; set B:
F(0.001;5,132)=146.83).
For set A, Scheffs post-hoc tests resulted
in three homogeneous subgroups (-1,-2,1,0 <
0,3 <3,2), where the middle subgroup is almost
significant at 5%. The tendency is therefore
that the perceived vowel duration in stimulus
conditions 2 and 3 differs from the other condi-
tions. But the vowel in batte is mostly per-
ceived as shorter than that in badde, as noted
at the beginning of this section.
For stimulus set B, there is also a division
into three homogeneous subgroups (-2,-1,0 <
0,1 <2,3), but the effect is more consistent with
our hypothesis, with the percentage of longer
vowel in batte responses increasing from
stimulus condition -2 (3%) to stimulus condi-
tion 3 (94%). A sudden change in the response
is visible going from condition 1 to condition 2.
Differences between the listeners
Clear differences can be observed between the
listeners. In the responses to stimulus set A (see
Figure 4), listener PL for instance follows the
expected pattern with a steady increase in the
number of longer vowel in batte responses
from stimulus condition -2 to condition 3, and
6 5 4 3 2 1
stimulus condition
100
80
60
40
20
0
r
e
s
p
o
n
s
e

p
e
r
c
e
n
t
a
g
e
set A
set B
-2 -1 0 1 2 3
stimulus condition

this subject had more than 50% such responses
in condition 3. Listener SH on the other hand
does not seem to be influenced by the differ-
ences between the stimuli, with around 50%
longer vowel in batte responses for all stimu-
lus conditions.
Figure 4. Individual listener response percent-
ages (of longer vowel in batte) for set A
This shows that it was not only the stimuli
which created the differences, although on the
other hand all subjects behaved roughly the
same for stimulus set B (see Figure 5), which
was more strongly controlled in that all stimuli
were derived from a single stimulus pair.
Discussion
The two stimulus sets show very different re-
sults. Set B seems to be most reliable in the ob-
served tendencies, both across stimulus condi-
tions and across subjects. These results show
that the vowel in batte is judged longer than
that in badde when the duration of the
modally voiced +breathy voiced +half of the
preaspirated portion of the vowel together ex-
ceed the duration of the vowel in badde. In
other words, the whole breathy voiced and half
of the preaspirated portion of the vowel contrib-
ute to its perceived length. This corresponds to
previous observations in concerning (post-)as-
piration.
But these observations are not corroborated
by the results for set A. In this stimulus set, the
variation across stimulus conditions is much
smaller and the behaviour of individual listen-
ers shows more variation. This may indicate
that the listeners rely on different cues for
vowel duration, which may also differ across
the stimulus pairs remember that the stimuli
in the listening experiment were not all based
on the same pair as in set B. Also, there was a
strong preference for longer vowel in badde
responses in all conditions for set B. We were
not able to find a reason for this, despite close
inspection of the stimuli.
The results highlight a possible inconsis-
tency in segmentation conventions, since aspi-
ration (also used in Norwegian after fortis plo-
sives) is normally considered part of the pre-
ceding plosive which causes it, whereas
preaspiration is normally segmented as part of
the vowel instead of the following plosive. The
results for their effect on perceived vowel dura-
tion which are reported here and the unpub-
lished results for aspiration show, however, that
both affect the perceived duration of the vowel.
Using traditional segmentation criteria can
therefore lead to wrong conclusions if the seg-
mentation in phonetic studies on vowel dura-
tion is used to make inferences about the effect
of vowel duration in perception.
Of course, this study was limited in its ap-
proach. Other consonantal places of articulation
and the speakers sex, which have been shown
to differ in production studies (e.g. Helgason
and Ringen, 2008), should be taken into con-
sideration in perceptual studies.
References
Helgason, P. and Ringen, C. (2008). Voicing
and aspiration in Swedish stops. J ournal of
Phonetics, doi: 10.1016/j.wocn.2008.02.003
(to appear).
Moxness, B (1997). Preaspirasjon in Trnder.
MA thesis NTNU.
Peterson, G.E. and Lehiste, I. (1960). Duration
of syllable nuclei in English. J. Acoust. Soc.
Am. 32 (6), 693-703.
Van Dommelen, W. (1998). Production and
perception of preaspiration in Norwegian.
In Proc. FONETIK 98, 20-23.
Van Dommelen, W. (1999). Preaspiration in
intervocalic /k/ vs. /g/ in Norwegian. Proc.
ICPhS, San Francisco, 2037-2040.
6 5 4 3 2 1
stimulus condition
80
60
40
20
0
r
e
s
p
o
n
s
e

p
e
r
c
e
n
t
a
g
e
AMK
AV
PL
J A
J
LGR
ILC
SH
Listeners:
-2 -1 0 1 2 3
stimulus condition
Figure 5. Individual listener response percent-
ages (of longer vowel in batte) for set B
6 5 4 3 2 1
stimulus condition
100
80
60
40
20
0
r
e
s
p
o
n
s
e

p
e
r
c
e
n
t
a
g
e
GU
EA
AM
SE
AMK
AV
PL
J A
J
LGR
ILC
SH
Listeners:
-2 -1 0 1 2 3
stimulus condition

The fundamental frequency variation spectrum
Kornel Laskowski
1
, Mattias Heldner
2
and Jens Edlund
2

1
interACT, Carnegie Mellon University, Pittsburgh PA, USA
2

Abstract
This paper describes a recently introduced vec-
tor-valued representation of fundamental fre-
quency variation the FFV spectrum which
has a number of desirable properties. In par-
ticular, it is instantaneous, continuous, distri-
buted, and well-suited to application of stan-
dard acoustic modeling techniques. We show
what the representation looks like, and how it
can be used to model prosodic sequences.
Introduction
While speech recognition systems have long
ago transitioned from formant localization to
spectral (vector-valued) formant representa-
tions, prosodic processing continues to rely
squarely on a pitch trackers ability to identify a
peak, corresponding to the fundamental fre-
quency (F0) of the speaker. Peak localization in
acoustic signals is particularly prone to error,
and pitch trackers (cf. de Cheveign & Kawa-
hara, 2002) and downstream speech processing
applications (Shriberg & Stolcke, 2004) employ
dynamic programming, non-linear filtering, and
linearization to improve robustness. These me-
thods introduce long-term dependencies which
violate the temporal locality of the F0 estimate,
whose measurement error may be better han-
dled by statistical modeling than by (linear)
rule-based schemes. Even if a robust, local, ana-
lytic, statistical estimate of absolute pitch were
available, applications require a representation
of pitch variation and go to considerable addi-
tional effort to identify a speaker-dependent
quantity for normalization (e.g. Edlund &
Heldner, 2005).
In the current work, we describe a recently
derived representation of fundamental frequen-
cy variation (see also Laskowski, Edlund, &
Heldner, 2008a, 2008b; Laskowski, Wlfel,
Heldner, & Edlund, in press), which implicitly
addresses most if not all of the above issues.
This spectral representation, which we will re-
fer to here as the fundamental frequency varia-
tion (FFV) spectrum is (1) instantaneous, not
relying on adjacent frames; (2) continuous, de-
fined for all frames; (3) distributed; and (4) po-
tentially sparse, making it suitable for the appli-
cation of standard acoustic modeling techniques
including bottom-up, continuous statistical se-
quence learning.
In previous work, we have shown that this
representation is useful for modeling prosodic
sequences for prediction of speaker change in
the context of conversational spoken dialogue
systems (Laskowski et al., 2008a, 2008b); how-
ever, the representation is potentially useful for
any prosodic sequence modeling task.
The fundamental frequency varia-
tion spectrum
Instantaneous variation in pitch is normally
computed by determining a single scalar, the
fundamental frequency, at two temporally adja-
cent instants and forming their difference. F0
represents the frequency of the first harmonic in
a spectral representation of a frame of audio,
and is undefined for signals without harmonic
structure. In the context of speech processing
applications, we view the localization of the
first harmonic and the subsequent differencing
of two adjacent estimates as a case of subop-
timal feature compression and premature infe-
rence, since the goal of such applications is not
the accurate estimate of pitch. Instead, we want
to leverage the fact that all harmonics are
equally spaced in adjacent frames, and use
every element of a spectral representation to
yield a representation of the F0 delta.
To this end, we propose a vector-valued re-
presentation of pitch variation, inspired by va-
nishing-point perspective, a technique used in
architectural drawing and grounded in projec-
tive geometry. While the standard inner product
between two vectors can be viewed as the
summation of pair-wise products with pairs se-
lected by orthonormal projection onto a point at
innity, the proposed vanishing-point product
induces a 1-point perspective projection onto a
point at (Figure 1). When applied to two vec-
tors representing a signals spectral content, F
L

and F
R
, at two temporally adjacent instants, the
vanishing-point product yields the standard dot
product between F
L
and a dilated version of F
R
,
or between F
R
and a dilated version of F
L
, for
positive and negative values of , respectively.

Figure 1. The standard dot-product shown as an orthonormal projection onto a point at infinity (left panel),
and the proposed vanishing-point product, which generalizes to the former when (right panel).

The degree of dilation is controlled by the mag-
nitude of . The proposed vector-valued repre-
sentation of pitch variation is the vanishing-
point product, evaluated over a continuum of .
For each analysis window, centered at time t,
we compute the short-time frequency represen-
tation of the left-half and the right-half portion
of the window, leading to F
L
and F
R
, respec-
tively, using two asymmetrical windows which
are mirror images of each other, as shown in
Figure 2.

Figure 2. Left and right windows used for the com-
putation of F
L
and F
R
, respectively, consisting of
asymmetrical Hamming and Hann window halves.
T
0
is 4 ms, and T
1
is 12 ms, for a full analysis win-
dow width of 32 ms. A 32 ms Hamming window is
shown for comparison.
F
L
and F
R
are N512-point Fourier trans-
forms, computed every 8. The peaks of the two
windows are 8 ms apart. The FFV spectrum is
then given by

[ ]
<

=
+
+
0 ,
| ) 2 (
~
| | ] [ |
| ) 2 (
~
| | ] [ |
0 ,
| ] [ | | ) 2 (
~
|
| ] [ | | ) 2 (
~
|
2 / 4 * 2
/ 4 *
2 * 2 / 4
* / 4
r
k F k F
k F k F
r
k F k F
k F k F
r g
N r
R L
N r
R L
R
N r
L
R
N r
L

where, in each case, summation is from
k = -N / 2 +1 to k = N / 2; for convenience, r
varies over the same range as k. Normalization
ensures that g[r] is an energy-independent re-
presentation. The frequency-scaled, interpolated
values
L
F
~
and
R
F
~
are given by

[ ]

[ ], 2 ) 1 ( 2 ) 2 (
~
k F k F k F
L L L L L

+ =

[ ]

[ ], 2 ) 1 ( 2 ) 2 (
~
k F k F k F
R R R R R

+ + +
+ =

where

. 2 2
, 2 2
k k
k k
R
L

+ +

=
=

A sample FFV spectrum, for a voiced
frame, is shown in Figure 3; for unvoiced
frames, the peak tends to be much lower and
the tails much higher. The position of the peak,
with respect to r = 0, indicates the current rate
of fundamental frequency variation. The sample
FFV spectrum shown in Figure 3 thus indicates
a single frame with a slightly negative slope,
that is a slightly falling pitch.

FL FR
T0 +T0
t
FL FR
T0 +T0
t

Figure 3. A sample fundamental frequency varia-
tion spectrum. The x-axis is in octaves per 8ms.

Figure 4: Filters in two versions of the filterbank.
The x-axis is in octaves per second; note that the
filterbank is applied to frames in which F
L
and F
R

are computed at instants separated by 0.008s. Two
extremity filters at (2, 1) and (+1, +2) octaves
per frame are not shown.
Filterbank
Rather than locating the peak in the FFV spec-
trum, we utilize the representation as is, and
apply a filterbank. The filterbank (FBNEW
shown in Figure 4) attempts to capture mea-
ningful prosodic variation, and contains a con-
servative trapezoidal filter for perceptually
flat pitch ('t Hart, Collier, & Cohen, 1990);
two trapezoidal filters for slowly changing
pitch; and two trapezoidal filters for rapidly
changing pitch. In addition, it contains two
rectangular extremity filters with spans of (2,
1) and (+1, +2) octaves per frame, as we have
observed that unvoiced frames have flat rather
than decaying tails. This filterbank reduces the
input space to 7 scalars per frame.
We show what a spectrogram representa-
tion looks like when FFV spectra from con-
secutive frames are stacked alongside one
another, in Figure 5, as well as what the repre-
sentation looks like after being passed through
filterbank FBNEW of Figure 4.
Modeling FFV spectra sequences
In order to transition from vectors of frame-by-
frame FFV spectra passed through a filterbank
to something more like what we normally asso-
ciate with prosody, such as flat, falling, and ris-
ing pitch movements, sequences of FFV spectra
need to be modeled. A standard option for
modeling sequences involves training hidden
Markov models (HMM). In previous work, we
have used fully-connected hidden Markov
models (HMM) consisting of four states with
one Gaussian per state (see Figure 6). However,
other HMM topologies are also possible.

Figure 5. Spectrogram for a 500ms fragment of au-
dio (top panel, upper frequency of 2kHz); the FFV
spectrogram for the same fragment (middle panel);
and the same FFV spectrum (bottom panel) after
being passed through the FBNEW filterbank as
shown in Figure 4.

Figure 6. A fully-connected hidden Markov model
(HMM) consisting of four states with one Gaussian
per state.

Discussion
We have derived a continuous and instantane-
ous vector representation of variation in funda-
mental frequency and given a detailed descrip-
tion of the steps involved, including a graphical
demonstration of both the form of the represen-
tation, and its evolution in time. We have also
suggested a method for modeling sequences
with HMMs and utilizing the representation in
a classification task.
Initial experiments along these lines show
that such HMMs, when trained on dialogue da-
ta, corroborate research on human turn-taking
behavior in conversations. These experiments
also suggest that the representation is suitable
for direct, principled, continuous modeling (as
in automatic speech recognition) of prosodic
sequences, which does not require peak-
identification, dynamic time warping, median
filtering, landmark detection, linearization, or
mean pitch estimation and subtraction
(Laskowski et al., 2008a, 2008b).
We expect the method to be especially use-
ful in situations where online processing is re-
quired, such as in conversational spoken dialo-
gue systems. Further experiments will test the
method in real systems, for example to support
turn-taking decisions. We will also explore the
use of the FFV spectrum in combination with
other sources of information, such as durational
patterns in interaction control.
Immediate next steps include fine-tuning the
filter banks and the HMM topologies, and test-
ing the results on other tasks where pitch
movements are expected to play a role, such as
the attitudinal coloring of short feedback utter-
ances (e.g. Edlund, House, & Skantze, 2005;
Wallers, Edlund, & Skantze, 2006), speaker ve-
rification, and automatic speech recognition for
tonal languages.
Acknowledgements
We would like to thank Tanja Schultz and Rolf
Carlson for encouragement of this collaboration
and Anton Batliner, Rob Malkin, Rich Stern,
Ashish Venugopal, and Matthias Wlfel for
several occasions of discussion. The work pre-
sented here was funded in part by the Swedish
Research Council (VR) project 2006-2172.
References
't Hart, J., Collier, R., & Cohen, A. (1990). A
perceptual study of intonation: An experi-
mental-phonetic approach to speech melo-
dy. Cambridge: Cambridge University
Press.
de Cheveign, A., & Kawahara, H. (2002).
YIN, a fundamental frequency estimator for
speech and music. The Journal of the
Acoustical Society of America, 111(4),
1917-1930.
Edlund, J., & Heldner, M. (2005). Exploring
prosody in interaction control. Phonetica,
62(2-4), 215-226.
Edlund, J., House, D., & Skantze, G. (2005).
terpretation of clarification ellipses. In Pro-
ceedings of Interspeech 2005 (pp. 2389-
2392). Lisbon, Portugal.
Laskowski, K., Edlund, J., & Heldner, M.
(2008a). An instantaneous vector represen-
tation of delta pitch for speaker-change pre-
diction in conversational dialogue systems.
In Proceedings ICASSP 2008. Las Vegas,
NV, USA.
Laskowski, K., Edlund, J., & Heldner, M.
(2008b). Machine learning of prosodic se-
quences using the fundamental frequency
variation spectrum. In Proceedings Speech
Prosody 2008. Campinas, Brazil.
Laskowski, K., Wlfel, M., Heldner, M., & Ed-
lund, J. (in press). Computing the funda-
mental frequency variation spectrum in
conversational spoken dialogue systems. In
Acoustics'08 Paris. Paris, France.
Shriberg, E., & Stolcke, A. (2004). Direct mod-
eling of prosody: An overview of applica-
tions in automatic speech processing. In
Proceedings of Speech Prosody 2004 (pp.
575-582). Nara, Japan.
Wallers, ., Edlund, J., & Skantze, G. (2006).
terpretation of synthesised backchannels. In
E. Andr, L. Dybkjaer, W. Minker, H.
Neumann & M. Weber (Eds.), Proceedings
of Perception and Interactive Technologies
(PIT'06) (pp. 183-187): Springer.


Speech technology in the European project MonAMI
Jonas Beskow
1
, Jens Edlund
1
, Bjrn Granstrm
1
, Joakim Gustafson
1
, Oskar Jonsson
2
& Gabriel
Skantze
1

1
2
Swedish Institute of Assistive Technology (SIAT), Vllingby, Sweden

Abstract
This paper describes the role of speech and
speech technology in the European project
MonAMI, which aims at mainstreaming ac-
cessibility in consumer goods and services, us-
ing advanced technologies to ensure equal ac-
cess, independent living and participation for
all. It presents the Reminder, a prototype em-
bodied conversational agent (ECA) which helps
users to plan activities and to remember what
to do. The prototype merges speech technology
with other, existing technologies: Google Cal-
endar and a digital pen and paper. The solution
allows users to continue using a paper calen-
dar in the manner they are used to, whilst the
ECA provides notifications on what has been
written in the calendar. Users may also ask
questions such as When was I supposed to
meet Sara? or Whats on my schedule to-
day?
Introduction
This paper presents the first version of a multi-
modal spoken dialogue system developed
within the European project MonAMI
(http://www.monami.info/). The objective of
the MonAMI project is to demonstrate that ac-
cessible, useful services for elderly and dis-
abled persons living at home can be delivered
in mainstream systems and platforms. The tech-
nology platforms delivering the services are
largely derived from standard technology, and
integrate elements such as wearable devices,
user interaction technology, and service infra-
structures to ensure quality of service, reliabil-
ity and privacy. The services are delivered on
mainstream devices and services such as digi-
tal-TV, cell telephones and broadband Internet.
As traditional human-machine interfaces of-
ten assume a degree of computer literacy and
are unintuitive to those unfamiliar with tech-
nology, development of innovative interfaces is
also a part of the MonAMI project. The overall
goal is to relieve human-computer interaction
from some of the demands posed on the cogni-
tive, visual and motor skills of the user, espe-
cially for elderly and disabled persons. Conver-
sational interfaces are a radically different ap-
proach to human-machine interaction where the
interaction metaphor is shifted from desktop
manipulation to spoken dialogue, modelled on
communication we are intrinsically familiar
with: human-human face-to-face spoken dia-
logue. The result is an ECA an embodied
conversational agent, communicating with
speech, facial expression, gaze and gesture.
The innovative interfaces effort within
MonAMI aims to develop interface technology
based on the ECA; to implement a prototype
that will be evaluated with users in the target
group; and to adapt and use existing design and
evaluation methods, based on end user in-
volvement, for gaining understanding of IT
functions and services that are considered
meaningful by people with disabilities and peo-
ple close to them. This demo paper presents the
first version of the Reminder, the prototype
ECA developed in the project in order to reach
these goals.
The task
The choice of target application for an ECA
prototype was informed by the services allo-
cated for the Swedish FU centre (a Feasibility
and Usability centre where user tests are held
in lab-like conditions) in MonAMI, and in par-
ticular by meetings held with two potential us-
ers, both of whom have had a brain tumour and
have cognitive disability, to identify potential
areas addressing real key problems in their
daily life. Based on these interviews, the choice
fell on an application helping users plan activi-
ties and remember what to do. The overall ap-
plication design is largely based on require-
ments from the interviews with the potential
users, both of whom used a range of reminder
applications and devices: paper calendars, pa-
per notes, PDA calendars, electronic white-
boards, and SMS notifications, and both of
whom expressed interest in using an ECA for
getting notifications. The reminder addresses
this by supporting pen input as well as speech,
as seen in the following scenario:

December 14
10:00 When speaking to Sara on the phone, Peter and Sara
agree on a meeting at 12:00 the next day. Peter writes
this down in his calendar.
December 15
8:00 Peter: What happens today?
System: At 12 oclock you have written meeting
with Sara.
Peter: Ok, remind me 1 hour ahead.

11:00 System: Peter?
Peter: Yes.
System: At 12 oclock, you have written meeting
with Sara.
Peter: Ok.

The domain presents hard challenges for ECAs.
For example, the things a person may want to
be reminded of vary indefinitely, which is a
problem for speech recognition.
The technology
The Reminder application architecture is based
on the Higgins architecture (Edlund et al.,
2004). The architecture is chiefly designed to
cater to development and research needs, such
as flexibility and ease of use, and places few
constraints on components, which can be im-
plemented in any language and run on any plat-
form. Figure 1 shows the components and the
message flow in the Reminder application.
From the ASR, the top hypothesis with word
confidence scores (2) is forwarded for natural
language understanding components. First it is
sent to the robust interpreter Pickering (Skantze
& Edlund, 2004), which makes a robust inter-
pretation of this hypothesis and creates context-
independent semantic representations of com-
municative acts (CAs). The results from
Pickering (3) are forwarded to the discourse
modeller Galatea (Skantze, in press), which
may be regarded as a further interpretation step
taking dialogue context into account. Galatea
adds these to a discourse model (DM). The dis-
course model (4) is passed to the Reminder Ac-
tion Manager, which initiates systems actions.
The Reminder uses Google Calendar as its
backend. When the discourse model is updated
by a user request for calendar information, the
action manager searches Google Calendar (5)
to generate a system response in the form of a
CA (7), which is passed to a component called
Ovidius (Skantze, 2007). Ovidius generates a
textual representation of the system utterance
(8) that forwarded to a multimodal speech syn-
thesiser for rendering (10). (The CA and the
textual representation are both passed to Gala-
tea (9) for inclusion in the discourse model.)
The text-to-speech synthesis and facial anima-
tion is responsible of producing verbal as well
as non-verbal responses from the system. The
animated character is based on a 3D parameter-
ised talking head that can be controlled by a
text-to-speech system to provide accurate lip-
synchronised audio-visual synthetic speech
(Beskow, 1997). The facial model includes
control parameters for articulatory gestures as
well as non-verbal expressions, which can be
derived from motion recordings or developed
using an interactive parameter editor (see
Beskow et al., 2005 for details).
Each time the system initiates or if the
Google Calendar entries are updated, the action
manager also parses the calendar entries to
build new speech recognition grammars and
send them to the ASR (6). A schematic of
Google Calendar can be seen in Figure 2, in
which the original service interfaces are repre-
sented by solid lines and the extensions imple-
mented in MonAMI by dotted lines. Utilising
Google Calendar brings the obvious advantage
of not having to provide hardware, software
and connectivity for the calendar backbone, but
there are several other benefits as well. Some of
the more noteworthy come from the fact that
the Google Calendar already provides user
APIs in the form of a Web GUI for input and
Figure 1: The Reminder architecture
Galatea:
Discourse modeller
Daytona: ASR
Pickering: NLU
Multimodal speech
synthesis
Ovidius: NLG
O
U
T
P
U
T
(1) audio
(3) CA
I
N
P
U
T
(10) audio
+video
Google
Calendar
(2) ASR result
(9) CA +text
(8) text
Reminder
AM
(7) CA
(4) DM
(6) grammar
(5) Google
API

output, and SMS notifications as a form of out-
put.
Mainly to meet the requirements from po-
tential users, and partly in order to address the
large and unknown vocabulary problem, we
designed a solution based on a mix of speech
technology and a digital pen and paper. To the
user, the effect of the pen input is that of writ-
ing down events in a seemingly ordinary paper
calendar. The text written by the user is trans-
ferred to a computer which performs handwrit-
ing recognition and transfers the information to
a calendar backbone. The information may then
be accessed by the ECA. The solution allows
the users to use a paper calendar like they are
used to, and addresses the ASR vocabulary
problem: users may write anything they like in
the calendar, but vocabulary can be limited to a
base vocabulary the contents of calendar en-
tries, which is used to update the vocabulary so
that the user may speak about events in the cal-
endar.
Next steps
The KTH Reminder is currently being prepared
for a first set of evaluation experiments with
potential users at SIAT (Swedish Institute of
Assistive Technology/Hjlpmedelsinstitutet).
Acknowledgements
This research was carried out at the Centre for
Speech Technology, a competence centre at
KTH, supported by MonAMI, an Integrated
Project under the European Commissions
Sixth Framework Program (IP-035147).
References
Beskow, J . (1997). Animation of talking agents.
In Benoit, C., & Campbel, R. (Eds.), Proc
of ESCA Workshop on Audio-Visual Speech
Processing (pp. 149-152). Rhodes, Greece.
Beskow, J ., Edlund, J ., & Nordstrand, M.
(2005). A model for multi-modal dialogue
system output applied to an animated talk-
ing head. In Minker, W., Bhler, D., &
Dybkjaer, L. (Eds.), Spoken Multimodal
Human-Computer Dialogue in Mobile En-
vironments, Text, Speech and Language
Technology (pp. 93-113). Dordrecht, The
Netherlands: Kluwer Academic Publishers.
Edlund, J ., Skantze, G., & Carlson, R. (2004).
Higgins - a spoken dialogue system for in-
vestigating error handling techniques. In
Proceedings of the International Confer-
ence on Spoken Language Processing,
ICSLP 04 (pp. 229-231). J eju, Korea.
Skantze, G. (2007). Error Handling in Spoken
Dialogue Systems - Managing Uncertainty,
Grounding and Miscommunication. Doc-
toral dissertation, KTH, Department of
Speech, Music and Hearing.
Skantze, G. (in press). Galatea: A discourse
modeller supporting concept-level error
handling in spoken dialogue systems. To be
published in Dybkjr, L., & Minker, W.
(Eds.), Recent Trends in Discourse and
Dialogue. Springer.
Skantze, G., & Edlund, J . (2004). Robust inter-
pretation in the Higgins spoken dialogue
system. In ISCA Tutorial and Research
Workshop (ITRW) on Robustness Issues in
Conversational Interaction. Norwich, UK.
Google
Calendar
Web GUI
ECA
SMS notifications
Handwriting
Google
Calendar
Web GUI
ECA
SMS notifications
Handwriting
Figure 2: Calendar interfaces. Dotted lines added interfaces.


Knowledge-Rich Model Transformations for Speaker
Normalization in Speech Recognition
Mats Blomberg, Daniel Elenius
Dept of Speech, Music and Hearing, CSC, KTH, Stockholm

Abstract
In this work we extend the test utterance adap-
tation technique used in vocal tract length nor-
malization to a larger number of speaker char-
acteristic features. We perform partially joint
estimation of four features: the VTLN warping
factor, the corner position of the piece-wise lin-
ear warping function, spectral tilt in voiced
segments, and model variance scaling. In ex-
periments on the Swedish PF-Star children da-
tabase, joint estimation of warping factor and
variance scaling lowers the recognition error
rate compared to warping factor alone.
Introduction
Mismatch between training and test data is a
major cause of speech recognition errors. Adap-
tation is one way to reduce this mismatch. If,
however, the adaptation utterances are un-
known, adaptation has to be performed in an
unsupervised manner. This is, less effective and
the performance gain is not as high as for su-
pervised adaptation. One explanation to this lies
in the fact that the conventional data-driven ad-
aptation algorithms impose low constraints on
the properties of the updated models. This
makes the process sensitive to recognition er-
rors.
Another view of the mismatch problem is
that very large amounts of training data are re-
quired for covering different speaker character-
ristics in current state-of-the-art recognition
systems. Considerable reduction should be pos-
sible if some of these properties could be in-
serted artificially.
A hypothesis in this paper is that including
known theory on speech production in the train-
ing and adaptation procedures can provide a
solution to these problems. In adaptation, the
knowledge could be used to constrain the up-
dated models to be realistic. The second prob-
lem, missing speaker characteristics in the
training corpus, could be approached by pre-
dicting their acoustic features and inserting
them into the trained models. In this way, the
models are artificially extended to a larger
training population.
Likelihood-maximisation based estimation
of explicit speaker or environment properties
has shown to be a powerful tool for speech rec-
ognition when there is no adaptation data avail-
able (Sankar and Lee, 1996). This is performed
by optimizing a small number of parameters to
maximize the recognition score of an utterance.
The parameters control the transformation of
the acoustic features of either the incoming ut-
terance, or the trained models.
One advantage of this approach compared
to common techniques for speaker adaptation,
e. g. MAP or MLLR, is the low number of pa-
rameters to control the adaptation. If only one
parameter is used and the likelihood function is
smooth enough to allow sparse sampling,
searching over the whole dynamic range of the
parameter is practically possible. A well-known
example of this is Vocal Tract Length Normali-
zation (VTLN) (Lee and Rose, 1996), where the
effect of different length of the supra-glottal
acoustic tube is often modeled by a single fre-
quency warping parameter, which expands or
compresses the frequency axis of the input ut-
terance or the trained models. VTLN has
proven to be efficient both within adult speak-
ers and, especially, for children using models
trained on adult speakers. In the latter case with
large mismatch between training and test data,
VTLN can reduce the errors by around 50%,
(e.g. Potamianos and Narayanan, 2003, and
Elenius and Blomberg, 2005).
The objective of this paper is to investigate
a few other speech properties and study how
they can be combined with VTLN. A require-
ment for successful transformation of a particu-
lar feature is that it should raise the discrimin-
ability between the correct and the incorrect
identities. Furthermore, the transformation
should produce realistic models, which suggests
an approach based on speech production theory.
This paper looks into a few candidate prop-
erties. We modify phone models of an HMM-
based recogniser using transforms related to
speaker characteristics. The transformations are
evaluated in unsupervised test utterance adapta-
tion in the challenging task of recognising chil-
drens speech using models trained on adult
speech.

Studied speaker characterization
features
Vocal Tract Length
The technique for VTLN, which is used in this
work, is based on a Gaussian distribution as-
sumption and a linear transformation. In this
case, the new feature distribution of the models
are achieved by multiplying the mean and co-
variance arrays by a transformation matrix. Pitz
and Ney (2005) have shown how this is done
efficiently in an MFCC (Mel Frequency Cep-
strum Coefficients) feature representation. This
technique is used in the current work. An ad-
vantage with this approach is that the transfor-
mation is applied on the models, not the input
utterance. This facilitates phoneme-dependent
warp factors, in contrast to input utterance
warping. In the latter case, the whole utterance
is normally uniformly warped, since its pho-
netic identity is unknown.
Piece-wise linear warping function
The warping function applied in this report is a
2-segment piece-wise linear function, with two
free parameters: warping factor and the upper
warp cut-off frequency. The latter is defined as
the projection of the break point onto a line
with slope 1.0, as in HTK (Young et. al. 2005).
One motivation for optimising the cut-off fre-
quency is that it might capture some aspect of
different scale factors for individual formants.
Expanding the frequency scale makes the
highest cepstral coefficients invalid and they
have to be excluded from recognition (Blom-
berg and Elenius, 2008).
Model variance
Besides having shorter vocal tract, an additional
source of higher error rate for children than for
adults is their higher variability (Potamianos
and Narayanan, 2003). Due to growth during
childhood, there is large inter-speaker variabil-
ity in physical size and accordingly acoustic
properties between individuals. Differences are
also caused by the developing acquisition of
articulation skill and consistency. Intra-speaker
variability was also observed to be larger than
for adults, possibly due to less established ar-
ticulation patterns. For these reasons, adult
models are judged to be narrower than those
trained on child speech.
Score maximization of the model variances
compensates not only for this variability differ-
ence but also for the mismatch in mean values.
Accordingly, the maximum likelihood point is
not expected to correspond to the true variabil-
ity ratio between adults and children. The de-
viation from the mean values is likely to be of
more importance.
Voice source spectral tilt
Studies on the voice source of children have
found that spectral tilt differs from that of
adults (Iseli, Shue and Alwan, 2006). Compen-
sation for this effect could be performed by
modifying the parameters of a voice source
model. In this work we perform a coarse ap-
proximation by adding a bias to the mean of the
first static cepstral coefficient, C1, to voiced
sounds. The variances remain unchanged as
well as all delta and acceleration parameters.
The transform is phoneme-dependent, since
only voiced phone models are modified.
Experiment
Speech data
The task for the experiments is digit string rec-
ognition. Childrens recordings were taken
from the Swedish part of the Pf-Star Childrens
Speech Corpus (Batliner, Blomberg, DArcy,
Elenius and Giuliani, 2002). This consists of
198 children from 4 to 8 years old. Each child
was aurally prompted for 10 3-digit strings.
The adult speakers were taken from Spee-
Con (Grokopf, Marasek, v. d. Heuvel, Diehl
and Kiessling, 2002). The number of digits per
speaker was equal to that in Pf-Star, but the
string length varied between 5 and 10 digits.
Both corpora were recorded through the same
type of directional head-set microphone.
Training and evaluation was conducted us-
ing separate data sets of 60 speakers in both
corpora, resulting in a training and test data size
of 1800 digits, except for the childrens test
data. Its size was 1650 digits due to the failure
of some children to produce all the three-digit
strings.
A more detailed speech data description is
presented in Blomberg and Elenius (2007).
Acoustic processing
Speech was divided into overlapping segments
at a frame rate of 100 Hz, using a 25 ms Ham-
ming window. Static, delta and acceleration
features of MFCCs and normalised log energy

were computed from a 38-channel mel filter-
bank in the frequency range 0-7600 Hz. Train-
ing and transformation was performed on 18
MFCCs while testing was made after removing
the upper six coefficients as mentioned above.
Phone model specification
Speech was modelled using phoneme-level
HMMs (Hidden Markov Models) consisting of
3 states each. Word-internal triphone models
were used, where the transition matrix was
shared for all contexts of a particular phone.
The feature observation probability distribution
for each state was modelled using a GMM
(Gaussian Mixture Model) of 32 terms.
Training and recognition experiments were
conducted using HTK 3.3. Separate software
was developed for the model transformation
algorithm.
Performed experiments
The baseline experiments use the VTLN algo-
rithm implemented in HTK with the same
analysis and acoustic conditions as in the other
experiments. HTK applies VTLN on the input
utterance.
The speaker parameters were estimated for
each utterance, by applying a joint grid search
for score maximisation. The search range of
each parameter was quantized into 10 steps. A
full 4-dimensional search is not feasible due to
the extensive amount of computation required.
We have restricted this to pair-wise full search,
where one of the parameters is always the fre-
quency warping factor. The reason for always
including the latter is its proven importance for
child speech recognition using adult models.
In order to determine if a parameter need to
be estimated for each utterance or if it should
rather be performed on a larger number of ut-
terances, we also compare the error rate be-
tween utterance and test set based optimisation
of the parameters.
Results
The results of the baseline and the experiments
with the proposed approach are displayed in
Table 1 and Table 2, respectively. Without any
transformation of the adult models the word
error rate is quite high, 33.0%, understandable
with regard to the low age of the children. The
standard VTLN algorithm in HTK roughly
halves the error rate to 15.2%.
Table 1. Baseline results for non-normalized train-
ing and test data.
Train
Test
Adult
Adult
Adult
Child
Child
Child
WER (%) 1.4 33.0 6.8

Table 2. Error rates for various combinations of
speaker factor optimization. The search range for
each parameter is indicated in the table head. The
label Opt indicates that this parameter is utter-
ance-estimated. Values denote the constant setting
on the whole test set. An asterisk at a value denotes
that it is a test set average during the optimisation
run marked by u. Absence of an asterisk indicates
a default value.
Warping
factor
1.0 -
1.7
Warp
cut-off
1000 -
7600
Variance
factor
1.0 -
3.0
C1
bias
-8 -
+10
WER
(%)
HTK 5700 - - 15.2
Opt
u
5700 1.0 0.0 12.8
Opt Opt
u
1.0 0.0 13.3
Opt 5700 Opt
u
0.0 11.0
Opt 5700 1.0 Opt
u
12.7
Opt 4632* 1.0 0.0 13.0
Opt 5700 1.58* 0.0 11.0
Opt 5700 1.0 -0.34* 12.7
Opt 4632* 1.58* -0.34* 10.8
1.25* 4632* 1.58* -0.34* 11.8

This is further reduced by model-based
VTLN transformation to 12.8%. Unexpectedly,
joint likelihood optimisation of warp factor and
warp cut-off frequency increases the error to
13.3%. Combined warp factor and variance
scaling search lowered the 12.8% error rate of
single warp factor optimisation to 11.0%, a re-
duction by 14% relative. The error rate of com-
bined C1 and warp factor optimisation differs
very little from that of a single warp factor.
Locking the parameters to their average es-
timates over the test set resulted in little differ-
ence in error rate compared to utterance optimi-
sation, except for the warping factor, which
raised the error to 11.8%.

Discussion
It is interesting to note the positive contribution
of variance scaling. A probable explanation is
that the effect of mismatch in the mean values
is reduced, rather than in the variances. In any
case, the result shows empirically that likeli-
hood maximisation of the model variances is
able to improve the recognition performance.
Optimising the warping cut-off frequency
and the chosen representation of voice source
spectral tilt did not improve the results. The
warp cut-off frequency had, in fact, slight nega-
tive impact on recognition performance. Re-
garding spectral tilt, a more detailed and accu-
rate voice source model than C1 bias should
work better. A general problem is that the trans-
formations can raise the score of incorrect iden-
tities more than that of the correct identity.
Less realistic transformations of incorrect iden-
tities are not penalised. One way to achieve this
could be to assign probabilities to the transform
parameter values.
Optimisation of the proposed speaker pa-
rameters to each utterance instead of optimisa-
tion to the whole test set turned out to be of lit-
tle value, except for the warping factor. It is
probable that speaker-specific parameter esti-
mation over more than one utterance would
have been a better choice. Phoneme-specific
settings might also be important.
Joint optimisation of the warping factor and
each of the other speaker features seems to have
had little or no advantage. However this is re-
garded as specific for the particular features
used.
Conclusions
Although the inclusion of the proposed trans-
forms result in only modest recognition im-
provement, we believe that the approach of us-
ing explicit transformations for extending the
properties of a given training population is a
promising candidate for combining knowledge
of speech production with data driven training
and adaptation. More work is required to de-
velop realistic and accurate transforms to dif-
ferent speaker characteristics and speech styles.
The results may give insight in speech relations
important for speech recognition. The demand
for efficient transformations may inspire to in-
tensified connections between the speech rec-
ognition and the phonetic-acoustic research
fields.
Acknowledgements
This work was financed by the Swedish Re-
search Council.
References
Batliner, A., Blomberg, M., DArcy, S.,
Elenius, D., Giuliani, D. (2002) The
PF_STAR Childrens Speech Corpus. In-
terSpeech, 2761 2764.
Blomberg, M. and Elenius, D. (2007) Vocal
tract length compensation in the signal and
model domains in child speech recognition.
Proc. Swedish Phonetics Conference, TMH-
QPSR, 50(1), 41-44.
Blomberg, M. and Elenius, D. (2008) Investi-
gating Explicit Model Transformations for
Speaker Normalization, ISCA Workshop on
Speech Analysis and Processing for Knowl-
edge Discovery, Aalborg, Denmark.
Elenius D. and Blomberg M. (2005) Adaptation
and Normalization Experiments in Speech
Recognition for 4 to 8 Year old Children.
Proc. Interspeech, 2749 2752.
Grokopf, B., Marasek, K., v. d. Heuvel, H.,
Diehl, F., Kiessling, A. (2002) SpeeCon -
speech data for consumer devices: Database
specification and validation. Proc. LREC.
Iseli, M., Shue, Y.-L., and Alwan, A. (2006)
Age- and gender-dependent analysis of
voice source characteristics. Proc. ICASSP,
389-392.
Lee, L. and Rose, R. C. (1996) Speaker Nor-
malization using efficient frequency warp-
ing procedures. Proc ICASSP, 353-356.
Pitz, M. and Ney, H. (2005) Vocal Tract Nor-
malization Equals Linear Transformation in
Cepstral Space. IEEE Trans. Speech and
Audio Proc. Vol. 13, No. 5.
Potamianos A. and Narayanan S. (2003) Robust
Recognition of Childrens Speech. IEEE
Trans. Speech and Audio Proc., 603 616.
Sankar, A. and Lee, C.-H. (1996) A Maximum-
Likelihood Approach to Stochastic Match-
ing for Robust Speech Recognition, IEEE
Trans. Speech and Audio Proc., Vol. 4, No.
3.
Young, S., Evermann, G., Gales, M., Hain, T.,
Kershaw, D.,Moore, G., Odell, J., Ollason,
D., Povey, D., Valtchev, V., Woodland, P.
(2005) The HTK book. Cambridge Univer-
sity Engineering Department.
Development of a southern Swedish clustergen voice
for speech synthesis
Johan Frid
Centre for Languages and Literature, Lund University
Abstract
This paper describes the development of a
speech synthesis voice with a southern Swedish
accent. The voice is built for the Festival speech
synthesis system using the tools in the festvox
suite. The voice type is clustergen, which is a
statistical-parametrical synthesis method where
parametrical models for phonemes, duration
and pitch all are built from a labeled speech
database.
Introduction
In recent years, much of the progress within the
field of speech synthesis have come within the
concatenative paradigm. Corpus-based methods
with speech material collected from several
thousands of utterances have been dominating
the field. This method has reached a high level
of naturalness and is widely used in commer-
cial systems today. These systems have a draw-
back though; they are limited in voice flexibil-
ity. Therefore, a recent development is to use
corpus methods within parametric synthesis as
well. everal techni!ues have emerged under
the name of statistical-parametrical synthesis
methods. The goal of these methods is to com-
bine the flexibility of parametric synthesis, thus
allowing variation in the voice source and the
prosody, with the robustness of corpus-based
methods.
Statistical-parametrical synthesis meth-
ods
The method we will use in this work is called
clustergen. The clustergen synthesis method
was developed by "lan #lack $#lack %&&',
#lack, (en ) Tokuda %&&*+. The basic idea is
to represent speech as ,-CCs $,el -re!uency
Cepstral Coefficients+, then generate the mean
of a number of similarly sounding speech seg-
ments and finally resynthesi.es speech using
,/" $,el /og pectrum "pproximation,
Imai 0123+.
"nother branch within statistical-paramet-
rical synthesis methods is 4,,-based synthes-
is, e.g. 4T by Tokuda, (en and #lack $%&&%+.
"n 4,,-based system was developed for
wedish by /undgren $%&&5+.
Developing a clustergen voice
In this section, we describe the different steps
involved in building a clustergen voice. This in-
volves corpus collection and preparation, re-
cording the prompts $sentences+ in the corpus,
autolabelling the corpus, and finally the actual
voice building process.
Corpus development
The clustergen voice building process needs a
database with good phonetic coverage. It is also
favorable if the sentences to be read does not
contain too many uncommon words and are
otherwise easy to read. " procedure for finding
suitable and phonetically balanced sentences is
described in 6ominek and #lack $%&&3+. The
key idea is to, rather than to make up sentence
after sentence and in the end hope that you get
it right, start with lots of text material and have
an automatic procedure look for the right things
among your sentences.
The first thing to do is to collect a suffi-
ciently large body of text. 7e selected the
wedish wikipedia encyclopedia, a version dat-
ing from %&&*-&*-%5. This consists of about
'&& ,# of xml formatted data. "fter some
processing, involving removing tags, captions,
headers and more, about '&&&&& sentences re-
mained. This was then reduced down to around
'&& sentences using a festvox script that applies
the following criteria8
each sentence should consist of 9-0&
words
each word should be among the 5&&&
most fre!uent
avoid all pictographic characters $only
letters, periods and comma were al-
lowed+
maximi.e phonetic coverage by includ-
ing as many different two-letter se-
!uences as possible
4ere are some example sentences8
"ristoteles ans:g att m;nniskor av
naturen ;r politiska varelser.
<essa fynd g=ordes i "frika, "sien,
>uropa och ?ordamerika
6arl @erhard fAddes som 6arl >mil
@eorg Bohnson
Cesten av str;ckan till ankt Detersburg
;r vanlig landsv;g
;songen blev mycket framg:ngsrik
och laget vann tanley Cup
>fter fem m:nader stod tyskarna utanfAr
,oskva.
Recording
The sentences were recorded in a !uiet office
with door closed, using a rather standard head-
set microphone connected to a laptop. #y using
a headset the distance from the microphone to
the mouth, and hence the recording level, was
kept constant. "ll the sentences were recorded
in one session with a very short break about
every 5& sentences. -or optimal pitch analysis
>>@ recordings would be desirable. "dditional
reduction of noise levels would have been
achieved in an anechoic chamber. 4owever, the
resulting sound !uality was found to be suffi-
cient, at least for the research purposes targeted
here.
Automatic labeling
The database must be phonemically labeled.
This can be done fully automatically if you
have a pronunciation lexicon for the words in
your sentences.
Defining the phoneset
In order to develop a lexicon, we first need to
develop a phone inventory or phone set. outh-
ern wedish differs from standard wedish in
that retroflexes rarely occur. Etherwise, the
phone set included all regularly occurring
phonemes in southern wedish with the addi-
tion of a few xenophones $>klund and /ind-
strAm %&&0+. 4ere is a summary8
nine long and nine short vowels
schwa. This is sometimes used in final
unstressed syllables
consonants8 [p t k b d g m n f s h v
j l ]
front and back r. outhern wedish nor-
mally has a back r, but some words
were foreign place names, which often
is pronounces with a front r
w, also since a few words have english
origins
Lexicon development
>arlier speech synthesis work at the department
has used the CT4 lexicon $4edelin, Bonsson
and /indblad 012*+, but for the current pro=ect
we decided not to use it for the following reas-
ons8
it is not targeted for southern wedish
it has a restricted license
Instead, work was started to develop an in-
house lexicon from scratch. The '&& sentences
contained about 0'&& different uni!ue words so
the task was not overwhelming. 4ere are some
example entries from the pronunciation lexicon8
$fArsta $f oe9 r s t a++
$fAddes $f oe3 d e% s++
$f:r $f aoF r++
$finland $f i9 n l a n d++
$d;r $d ae r++
$delar $d eF3 l a% r++
$b:da $b aoF3 d a%++
$bland $b l a n d++
7e use a simple phonetic alphabet where only
"CII characters are allowed in pronunciation
entries. This is because its easier to enter these
characters and keeps things simple for the com-
puter.
In the example above, the pronunciation
entries are not syllabified but this is done auto-
matically later. Drosodic information about
vowel length, stress position and word accent is
included. The F indicates a long vowel, other-
wise all vowels are short by default. The num-
bers mean8
98 main stress, accent 0
38 main stress, accent %
%8 secondary stress
In monosyllabic words, prosody information is
redundant as these are always stressed on the fi-
nal syllable and have word accent 0.
?ote that in southern wedish, compound
words may retain accent 0 if the first element of
the compound is an accent 0 word $#ruce 01*9,
-rid %&&3+. This is different from other varieties
of wedish where compound words almost in-
variably has accent %. ince the lexicon is tar-
geted towards southern wedish, compound
words with accent 0 are given an pronunciation
entry with accent 0. This is easy to modify for
other varieties of wedish, where the main
stress accent 0 indicator G9G could be changed
into main stress accent % indicator G3G.
The parentheses structure seen in the ex-
ample is the normal format used for festival
lexicons.
Doing the labeling
The actual labeling is done through forced
alignment. -or each sentence, the pronunciation
of each word is looked up. This results in a
phoneme string. The phoneme strings are then
aligned with the utterances using the >4,,
procedure in the festvox package.
Building
The voice is constructed by building decision-
tree based models from data. >ach phoneme is
divided into three states in order to handle coar-
ticulation effects. -or each state, trees are built
for prediction of8
,-CC
-&
duration
In the trees, features such as phonetic context,
syllable structure and word position are used.
The whole building process is automated and
done with tools provided in the festvox pack-
age.
Resynthesis
The resynthesis process works as follows8 The
?/D component produces a phoneme string,
where each phoneme again is divided into three
phone states. <uration is produced by the dura-
tion tree. "s we now have temporal informa-
tion, we can step through the utterance at an in-
terval of, e.g. 5 ms and at every n8th milli-
second predict -& and ,-CC parameters from
the phone state that is GactiveG at the current time
frame. Cesyntesis is then done through ,/".
Results
Included in the festvox tools is a script to pro-
duce some numerical measurements based on
comparisons of synthesi.ed utterances with real
utterances. 4ere are the results8

all mean 0.*2 std 55.&1
-& mean 2.*' std %'0.55
no-& mean &.3 std &.*1
,C< mean *.'% std 5.**
The numbers give the mean difference for all
features in the parameter vector, for -& alone,
for all but -&, and ,C< $mel cepstral distor-
tion+.
Notes on building a voice in a new
language
The following steps are needed to build a voice
in a new language8
develop a corpus $H 5&& prompt sen-
tences+
record the prompts
develop a phoneset and a phonetic lex-
icon for the words in the corpus
decide on a prosodic model. The default
model only differs between stressed and
unstressed syllables, but for wedish we
need to handle word accents.
The rest of the process is done through tools in
the festvox package. The corpus processing can
take some time, especially if you have a large
material. Cecording can be done in less than a
day. The lexicon development can be tedious,
but something like 5&& words a day is possible.
The rest of the voice building is more or less
automatic. The labeling takes lots of time, the
voice building a little less. 4owever, once the
voice is built it can be used instantaneously; it
is as fast as any festival voice.
Improvements
The procedure described above is enough to
give you a working voice. 4owever, the !uality
of the voice can be improved in several ways.
Ene easy way is to use more sentences. This
will increase the material used in the model
building. "nother thing is to recheck the labels.
The automatic labeling process works rather
well; however, there is room for some improve-
ment. 7e have found instances where the lex-
icon contains a full pronunciation form but the
prompt recording contains a reduced pronunci-
ation. ince the labeling works by forced align-
ment this may introduce errors. It would also be
interesting to explore more elaborate prosodic
modeling, for instance feet structure. Currently
there is no support for this in festival.
References
Black, A. (2006), CLUSTERGEN: A Statistical
Parametric Synthesizer using Trajectory
Modeling, Interspeech 2006 - ICSLP, Pitts-
burgh, PA.
Black, A., Zen, H., and Tokuda, K., (2007)
Statistical Parametric Synthesis, ICASSP
2007, Hawaii.
Bruce, G., (1974) Tonaccentregler fr sam-
mansatta ord i ngra sydsvenska stadsml.
In Platzac, C., editor, Svenskans beskrivn-
ing, number 8, pages 62-75.
Eklund, R., and Lindstrm, A., (2001) Xeno-
phones: An Investigation of Phone Set Ex-
pansion in Swedish and Implications for
Speech Recognition and Speech Synthesis.
Speech Communication 35, vols. 12, pp.
81102.
Frid, J., (2003) Lexical and Acoustic Modelling
of Swedish Prosody, Department of Lin-
guistics and Phonetics, Lund University.
Hedelin, P., Jonsson, A., and Lindblad. P.,
(1987) Svenskt uttalslexikon: 3 ed. Tech Re-
port, Chalmers University of Technology.
Imai, S., (1983) Cepstral analysis/synthesis on
the Mel frquencyscale, in ICASSP-83, Bo-
ston, MA, 1983, pp. 9396.
Kominek, J. and Black, A. (2003) CMU ARC-
TIC databases for speech synthesis CMU
Language Technologies Institute, Tech Re-
port CMU-LTI-03-177
Lundgren, A. (2005) HMM-baserad talsyntes.
Master's Thesis.
Tokuda, K., Zen, H., and Black, A. (2002) An
HMM-based speech synthesis system ap-
plied to English, Proc. of 2002 IEEE SSW,
Sept. 2002.

The perception of English consonants by Norwegian lis-
teners: A preliminary report
Wim A. van Dommelen
Department of Language and Communication Studies, NTNU, Trondheim

Abstract
This study is part of a multilingual project
1
in-
vestigating native and non-native perception of
English consonants. In the present investiga-
tion, VCV syllables presented in quiet and in
different types of noise were identified by Nor-
wegian listeners. The results showed that even
in quiet not all consonants were recognized
correctly. Consonant confusions can be inter-
preted as caused by phonological, spectral as
well as orthographic factors. Different types of
noise appeared to affect identification to differ-
ent degrees. The data suggest that noise-
specific impact is explainable in terms of ener-
getic vs. informational masking.
Introduction
This study is part of the multilingual Consonant
Challenge project investigating consonant iden-
tification by native and non-native humans and
machines (Cooke and Scharenborg, 2008). The
goal of the present paper is to look into conso-
nant identification by Norwegian second lan-
guage (L2) users of English. The consonant
systems of the two languages differ in several
respects, in particular in terms of the phoneme
inventory (e.g., Davidsen-Nielsen, 1975). Eng-
lish consonants lacking Norwegian counter-
parts are the fricatives //, //, /z/ and //. Fur-
ther, the affricates //and // dont belong to
the Norwegian phonological system. Also,
there is no approximant /w/. An important re-
search question is therefore how Norwegian
natives accommodate the new L2 sounds. Will
they be established as new phoneme categories
and what kind of assimilation will take place
(Flege, 1995; Best, 1995)? The present paper is
a first step to shed some light on these issues.
Method
Speech material and speakers
For the Consonant Challenge project, native
speakers of English were recorded producing
VCV syllables, where C was one of 24 conso-
nants (see Figure 1) and the two vowels were
/i:/, // or /u:/ in all possible combinations (for
details, see Cooke and Scharenborg, 2008). The
present investigation used material from four
male and four female talkers in seven different
sets. While in the first set the speech material
was presented in quiet, the other six contained
concurrent noise (Table 1). Competing talker
and 3-speaker babble noise can be expected to
have not only an energetic but in particular also
an informational masking effect. The other
types of noise represent different forms of en-
ergetic masking. Signal-to-noise ratios were
chosen to achieve similar overall listener per-
formance and to avoid ceiling effects. Note that
this implies limitations for inter-set compari-
sons. Each of the seven sets contained two to-
kens of each of the 24 consonants from each
speaker (2 x 24 x 8=384 VCV syllables).
Table 1. Noise conditions.
Set Noise type SNR
1 quiet
2 competing talker -6
3 8-speaker babble -2
4 speech-shaped noise (SSN) -6
5 factory noise 0
6 modulated SSN -6
7 3-speaker babble -3
Listening tests
A group of 21 Norwegian subjects aged 19 - 31
years (mean 23.0 years) participated as paid
listeners. They were all students at NTNU
without reported hearing problems and none of
them studied English. The majority of them had
no or almost no training in phonetics.
Individual listeners were presented with the
stimuli sitting in the departments sound-treated
studio. They identified the consonants by click-
ing on orthographic symbols like B, G, DH and
ZH presented on a computer screen. The con-
sonant symbols appeared on buttons together
with an example word (for the above-
mentioned symbols Bee, Guard, oTHer and
meaSure). For each of the listeners, the order of
both the stimuli and the sets with the noise
conditions was randomized. The only exception
to this was the quiet condition, which was al-

ways run first. Preceding the listening test sub-
jects received an oral instruction and went
through a training set with 72 (3 different in-
stances of 24) stimuli in quiet. In this training
set, but not in the actual test, feedback was
given with the possibility to replay a stimulus.
Results
Identification in quiet
As appeared from the results for the quiet con-
dition, consonant identification was not always
a trivial task. Overall listener rates for this con-
dition amounted to 82.0 % correct, individual
results ranging between 46.2 % and 94.5 %
(standard deviation 9.4 %). For their group of
23 native English subjects, Cooke and Scharen-
borg (2008) report a mean of 93.8 % correct.
The natives thus clearly outperformed the non-
natives. It should be noted, however, that five
English listeners have been excluded by Cooke
and Scharenborg; two of them because they
failed to reach a criterion of 85 % correct in the
training session with stimuli in quiet.
To investigate the role of type of consonants
and single consonants in identification, confu-
sion tables were compiled. From the confusion
data presented in Figure 1 (top) it can be seen
that identification scores for the liquids (/l/, /r/),
plosives (/p/, /t/, /k/, /b/, /d/, /g/) and nasals
(/m/, /n/, //) were high (96.1 %, 95.8 % and
92.7 %, respectively). Substantially lower rates
were found for the affricates (//, //), glides
(/y/, /w/) and fricatives (/f/, /v/, //, //, /s/, /z/,
//, //, /h/) with 73.8 %, 73.7 % and 69.7 %
correct, respectively.
From the above-mentioned three groups
with highest identification rates, only the 84 %
rate for // seems to be exceptionally low. The
confusion with /n/ (6 %) can be explained by
spectral similarities of the two nasals. More of-
ten the velar was identified as /g/. This is pre-
sumably due to variation in production by the
English speakers who in some cases realized
intended /VV/ as /VgV/. It seems natural that
in those cases listeners were not sure how to
respond and chose the velar plosive.
Recognition rates for he two affricates //
and // were 82 % and 66 %. Though the for-
mer one is usually not considered to belong to
the Norwegian phoneme inventory (Kristof-
fersen, 2000), it occurs in dialects for ortho-
graphic <k>. The confusion with // (5 %) is
probably due to listeners splitting up the affri-
cate in its plosive and fricative components.
The 5 % // responses for the voiced affricate
may be explained by similar reasoning. This
voiced sound does not occur in Norwegian,
which is reflected in the confusion with its
voiceless cognate (11 %). In addition to phono-
logical factors also orthography seems to have
had an impact on response behavior (cf. Detey
and Nespoulous, 2008). In 9 % of the cases,
// was identified as /g/. This is obviously a
confusion with the orthographic symbol <g>
that is often representing the phoneme // in
English writing.
Since the Norwegian consonant system is
lacking a glide /w/, the relatively low score
(79 %) for this sound is understandable. Also,
the choice of the /v/ in 13 % of the cases seems
natural in the light of phonetic similarity. Fur-
ther, it is reasonable to assume that the 28 %
identification of /y/ as // is due to ortho-
graphic confusion. It looks like listeners cor-
rectly perceived the glide (in Norwegian writ-
ing represented with the symbol <j>) but incor-
rectly chose the <J , J ar>button as response.
As mentioned above, lowest identification
rates were found for the category of fricatives.
Here, the combined influence of various factors
can be postulated. First, in the Norwegian pho-
nological system the voicing opposition is far
less utilized than in English. Neither /s/ nor //
have voiced counterparts, the same is true for
the phonetically occurring affricate //. Sec-
ond, though Norwegian has the pairs /f/ vs. /v/
and // vs. /j/, the opposition is phonetically in
particular fricative vs. approximant. Further,
Norwegian does not have any dental fricatives
(// and //). Also, fricative identification might
have been affected by problems of mapping
sounds onto writing symbols. This could espe-
cially be true for voiced //, which was repre-
sented as <TH, oTHer>in the listening test.
As can be seen from Figure 1, the voicing
distinction generally caused identification prob-
lems for the listeners, cf. recognition rates of
18 % // for //, 24 % // for //, 17 % /z/ for
/s/, 18 % /s/ for /z/, 14 % // for // and 24 % //
for //. The latter two were also identified as
10 % // and 17 % //, respectively, which
reflects the general phenomenon of confound-
ing single fricatives and affricates. In addition,
identification of place of articulation was a
challenge in particular for the labiodental vs.
dental fricatives. It should be noted that Cooke
and Scharenborg (2008) observe rather high
error rates in native recognition of // (42 %)

p b t d k f v s z h m n l r y w
p 99
b 3 97
t 94 1 2 1 1
d 2 90 2 1 3
k 1 99
5 95
2 82 3 2 3 5 2
3 9 11 66 2 1 5 2
f 1 76 1 14 8
v 6 59 8 12 1 1 13
3 4 71 18 1 1
2 1 1 8 24 61 3
s 1 78 17 1 2
z 18 78 1 3
18 1 1 65 14
4 10 17 2 1 24 42 1
h 1 98 1
m 98 2
n 1 96 1 3
10 6 84
l 100
r 1 1 1 93 4
y 1 28 2 68 1
w 1 13 1 1 4 79

p b t d k f v s z h m n l r y w
p 56 6 3 1 8 3 2 1 4 2 7 1 1 1 1
b 11 55 1 3 1 2 1 1 1 6 3 5 1 2 1 1 1 3
t 1 80 2 3 1 1 6 2 1
d 1 1 8 65 1 7 1 3 4 1 1 1 2 2 1
k 2 1 3 82 5 1 2 1
1 2 2 3 12 67 1 1 2 1 6 1
1 3 2 2 69 9 1 1 1 6 3
1 2 1 10 14 61 1 2 3 1 1 1
f 3 1 1 1 57 3 23 5 1 1 1 1
v 4 8 1 1 1 1 6 38 6 13 1 1 1 1 1 13
2 3 1 18 1 54 13 4 1
1 4 3 7 1 2 12 17 34 2 4 1 1 1 1 2 1 5
s 1 2 4 1 71 17 2 1
z 1 1 1 2 4 18 65 1 2 1 1
1 12 1 2 67 13
4 6 21 1 1 1 15 44 1 1 1
h 2 3 1 1 3 2 1 2 2 2 2 67 1 1 1 1 1 1 2
m 2 5 1 3 1 1 2 67 7 4 2 5
n 1 2 1 4 1 1 1 1 1 1 1 1 9 56 2 11 2 1 3
2 3 2 21 1 1 1 1 1 2 1 2 4 51 2 2 1 1
l 1 3 1 1 1 3 1 1 1 1 1 1 2 76 2 4
r 1 2 1 1 1 6 1 1 1 1 2 1 4 62 1 12
y 1 1 1 1 1 4 18 2 1 1 1 1 2 1 1 4 3 1 50 3
w 5 1 1 1 1 12 1 1 1 3 2 1 3 5 1 62
Figure 1. Top: Confusion matrix for consonant perception in quiet. Bottom: Mean recognition rates
across six types of noise. Values rounded to nearest percent. 21 listeners; 100 % =336 judgments.
Rows: spoken; columns: heard.


and // (21 %). They speculate that poor ortho-
graphic-phonemic correspondence could be
part of the problem.
Identification in noise
Recognition scores averaged across all six
noise conditions are presented in Figure 1 (bot-
tom). Confusion patterns appear to be similar to
those found for the quiet condition. The hierar-
chy is here liquids (68.9 %), plosives (67.4 %),
affricates (65.3 %) followed by nasals (58.2 %),
glides (56.1 %) and fricatives (55.1 %).
Overall identification scores for the various
noise conditions are presented in Table 2. Com-
paring the different types of noise it should be
kept in mind that signal-to-noise ratios differed
for the six test sets (Table 1). From Table 2 it
can be seen that competing talker (set 2) and
speech modulated noise (set 6) had similar ef-
fects on identification scores (65.8 % vs.
67.0 % correct). This parallels the findings in
Cooke and Scharenborg (2008) and confirms
their conclusion that informational masking (set
2) is not a major factor for the competing
speech masker in the present VCV recognition
task.
Table 2. Overall scores for seven noise conditions.
Set Noise type Score (%)
1 quiet 82.0
2 competing talker 65.8
3 8-speaker babble 63.4
4 speech shaped noise (SSN) 59.2
5 factory noise 52.2
6 modulated SSN 67.0
7 3-speaker babble 56.3
Lowest scores were found for factory noise
(52.2 %) in spite of its SNR being least severe
(0 dB). This can be explained as being due to
the spectral characteristics of this type of noise.
Also, it seems to confirm the generally strong
impact of energetic masking on speech percep-
tion. It is noteworthy that the noise conditions
8-speaker and 3-speaker babble had differently
detrimental effects on recognition in spite of
similar SNRs (-2 dB and -3 dB). The lower rate
for the latter type of noise (56.3 % vs. 63.4 %
for the former) can be speculated to reflect an
informational masking effect: Whereas it is
possible to recognize the speech of single
speakers in the 3-speaker babble, this is most
often virtually impossible in the 8-speaker bab-
ble. Due to factors that should be investigated
in the future, the present N-talker effect is at
odds with Simpson and Cooke (2005; but in
line with Cooke and Scharenborg 2008).
Conclusion
The results from the present study suggest that
consonant recognition by L2 users of English is
affected by differences between phonological
systems, spectral similarities of sounds as well
as orthography. The impact of energetic vs. in-
formational maskers is an issue that requires
further research.
Acknowledgements
I would like to thank Martin Cooke and Maria
Luisa Garcia Lecumberri for design and prepa-
ration of the listening test material.
Notes
1. Consonant Challenge, Interspeech 2008. Or-
ganizers and main coordinators Martin Cooke
(University of Sheffield, UK), M. Luisa Garcia
Lecumberri (University of the Basque Country,
Spain) and Odette Scharenborg (Radboud Uni-
versity Nijmegen, The Netherlands).
References
Best C.T. 1995. A direct realist view of cross-
language speech perception. In Strange W.
(ed) Speech Perception and Linguistic
Experience: Issues in Cross-Language
Research, 171-203. Timonium: York Press.
Cooke M.P. and Scharenborg O. (2008) The
Interspeech 2008 Consonant Challenge.
Submitted to Interspeech 2008.
Davidsen-Nielsen N. (1975) English Phonetics.
Oslo: Gyldendal Norsk Forlag.
Detey S. and Nespoulous J .L. (2008) Can or-
thography influence second language syl-
labic segmentation? Lingua 118, 66-81.
Flege J .E. (1995) Second language speech
learning: Theory, findings, and problems. In
Strange W. (ed) Speech Perception and
Linguistic Experience: Issues in Cross-
Language Research, 233-277. Timonium:
York Press.
Kristoffersen G. (2000) The Phonology of
Norwegian. Oxford: Oxford University
Press.
Simpson S. and Cooke M. P. (2005) Consonant
identification in N-talker babble is a non-
monotonic function of N. J ournal of the
Acoustical Society of America 118, 2775-
2778.

Reaction times in the perception of quantity in Icelandic
Jrgen L. Pind
Department of Psychology
University of Iceland, Reykjavk

Abstract.
Icelandic has a contrast of vowel and conso-
nant quantity. Previous research has shown
that the ratio of vowel to rhyme duration is the
major cue for quantity, with vowel spectrum
additionally playing a role in the three central
vowels of Icelandic, [!], ["], and []. This pa-
per describes an experiment in which reaction
times are measured as well as identification
judgments for three stimulus continua, one con-
taining the vowel [a], two having the vowel [!]
with different vowel quality, suggesting either a
long or a short vowel. It was hypothesized, and
experimentally confirmed, that reaction times
are longer for the continuum having the vowel
[a]. In that case the listener is forced to base
his or her judgment of quantity on both the
vowel and the following consonant.
Quantity in Icelandic
Icelandic has a contrast of quantity, having both
long and short vowels and consonants (Einars-
son, 1927; Pind, 1986). Quantity in closed syl-
lables in Icelandic is complementary in that a
long vowel is followed by a short consonant
and vice versa (Benediktsson, 1963). Previous
experiments have shown that two cues are of
paramount importance for the perception of
quantity in Icelandic. The primary cue is a the
ratio of vowel to rhyme duration (Pind, 1995)
which functions as a higher-order invariant
for the perception of quantity (Gibson, 1959).
This is a temporal speech cue since it is defined
by the durations of the speech segments. Inter-
estingly, this cue is invariant, or nearly so, over
transformations of speech rate. The other cue is
spectral. Icelandic has eight vowel mo-
nopthongs, three of these are spectrally appre-
ciably different in their long and short versions,
namely the three central vowels [!], ["], and [].
Previous perception experiments with the
vowel [!] have shown that the perception of
quantity is heavily influenced by the spectrum
of the vowel (Pind, 1996, 1998).
The fact that for some vowels two speech
cues are available to cue the quantity contrast
suggests that the temporal course of quantity
perception might be different from those cases
where only one cue the ratio cue is avail-
able. In particular, it may be hypothesized that
in those cases where a spectral cue is available
a quantity decision could be reached faster than
in those cases where the durational cue is the
only cue available. In the latter case it is pre-
sumably necessary for the perceiver to wait un-
til the end of the syllable rhyme to establish the
quantity, in those cases where a spectral cue is
also available it may be hypothesized that a de-
cision as to the nature of quantity may be
reached at an earlier point in the syllable.
Experiment 1
The purpose of the following experiment is to
investigate the temporal course of quantity per-
ception using reaction times as well as catego-
rization judgments. Words with two vowels are
used, namely the vowels [a] and [!].
Method
Participants
The author and nine undergraduates at the Uni-
versity of Iceland participated in the experi-
ment. All reported normal hearing.
Stimuli
Three synthetic continua were used in the ex-
periment, made with the Sensyn speech synte-
sizer, a version of the Klatt (1980; Klatt &
Klatt, 1990) synthesizer. Two of the continua
were seggsek continua, one a saggsak con-
tinuum. The word sek [s!#k] is the female form
of the word sekur, guilty; segg [s!k#] the ac-
cusative of seggur, man. The word sak [sa#k]
is the stem of the verb saka, accuse while
sagg [sak#] is the stem of the word saggi,
dampness. The two seggsek continua were
earlier used in a published experiment, full de-
tails of the synthesis parameters are given in
Pind (1998). The stimuli were originally mod-
eled on spoken tokens of the words. Briefly, all
stimuli started with a 100 ms fricative segment
corresponding to the s, followed by a 400 ms
syllable rhyme consisting of a vowel having a
variable duration from 108 to 264 ms and fol-

lowed by a closure, also of variable duration
from 292 to 136 ms. [Thus 108 + 292 = 400 ms
and 264 + 136 = 400 ms.] In all there were 14
stimuli in each continuum, made by lengthen-
ing the vowel in 12 ms steps and shortening the
closure by the same amount. The final release
burst of the closure had a duration of 30 ms.
For one of the seggsek continua, hereafter
called the e475 continuum, the steady state val-
ues of the first two formants were 475 and 1800
Hz, in the other continuum, hereafter the e510
continuum, the values were 510 and 1700 Hz.
The former values are close to those of long
vowels, the latter in the direction of a short
vowel. The steady state frequencies of F3F5
were fixed at 2600, 3250 and 3700 Hz in both
continua.
The saggsak continuum was modeled on
the two previous continua with the same dura-
tions of segments. The steady state values of
F1F5 were 750, 1280, 2450, 3250 and 3700
Hz.
The fundamental frequency during the
voiced portion of the stimuli was fixed at 125
Hz and the synthesizer was set to use an update
interval of 4 ms. The sampling frequency was
11.025 Hz.
Procedure
The experiment was run using the E-prime
software (Psychological Software Tools, Pitts-
burgh, PA) on an IBM Thinkpad 41p using the
built in sound-card of the computer. The ex-
periment was split in two parts, in the first part
the saggsak stimulus continuum was tested, in
the second part both seggsek continua were
tested simultaneously. Both parts started with a
short familiarization block where two endpoint
stimuli were played five times each with ap-
propriate labels by the experimenter. Following
this came a practice block where each stimulus
was played two times in randomized order.
This was followed by five experimental blocks,
each containing two examples of each stimulus
in randomized order. The stimuli were played
at a comfortable listening level over Sennheiser
HS-530-II circumaural headphones.
Participants gave their responses by clicking
the g or k buttons on the computer key-
board, the former if they perceived the stimulus
as sagg or segg, the latter if they perceived it is
sak or sek. Participants were instructed to re-
spond as quickly as possible but also to respond
accurately.
Results
Figure 1 shows the identification curves for the
three stimulus continua. The upper part of the
figure shows the identifications for the sagg-
sak continuum. The lower half of the figure
shows the identification curves for the two
segg-sek continua. Phoneme boundaries were
calculated for individual participants using the
method of probits. On average the phoneme
boundary was placed at 191.4 ms, in the sagg
sak continuum, at 174.3 ms in the e475 contin-
uum and at 188 ms in the e510 continuum.

Figure 1. Identification curves for the three stimulus
continua used in the present experiment. Averages
of 10 participants.
A one-way repeated measures ANOVA
shows the boundaries to be significantly differ-

ent, F(2,18) = 4.831, p < 0.05. Comparing indi-
vidual continua shows only a significant differ-
ence between the saksagg and the e475 con-
tinua, t(9) = 2.5851, p < 0.05. While the identi-
fication curves for the two segg-sek continua
differed in the expected direction the difference
did not reach statistical significance.
Figure 2 shows the results for the reaction
times in the identification task. Again the upper
half of the figure shows the results for the
saggsak continuum, the lower half of the fig-
ure the results for the two seggsek continua.
Average reaction times for each participants
judgments of the individual stimuli were calcu-
lated as the harmonic means of the individual
response times. The harmonic mean, H, of the
scores

x
1
,, x
n
is defined as

1
H
=
1
n
1
x
i
i =1
n
!
.
It has the useful property that it lessens the ef-
fects of outliers.
The overall mean reaction times for the
three series of stimuli were 942.5 ms for the
saggsak continuum, 885.5 ms for the e475
continuum and 870.9 ms for the e510 contin-
uum. Overall, the reaction times for the a-series
were significantly longer than in the e-series,
t(418) = 3.0997, p < 0.01. The reaction times
for the two e-series were, however, not signifi-
cantly different, t(278) = 0.8121.
The distribution of the reaction times for the
sagg-sak continuum upper panel of Figure 2
shows the longest reaction times near the pho-
neme boundary, with shorter reaction times the
farther one gets away from the boundary. This
is as expected (Nooteboom & Doodeman,
1980). Thus the average RT is 851.9 ms for the
word with the 108 ms long vowel, 1024.1 ms
for the stimulus closest to the phoneme bound-
ary (having a vowel duration of 180 ms), and
928.4 ms for the stimulus at the other endpoint
having a vowel duration of 264 ms. Subjecting
the reaction times of these three stimuli to a
one-way repeated measures ANOVA shows
that they are significantly different, F(2,18) =
6.4491, p < 0.01. Pairwise comparisons show
the 108 ms stimulus to be significantly different
from the 180 ms long stimulus, t(9) = 2.9731;
p < 0.05. Other comparisons are not signifi-
cantly different, though the difference between
the 108 and 264 ms stimuli are marginally sig-
nificant, t(9) = 2.2613, p = 0.05007, two-tailed.
This fact underscores a suggestive trend
which can be noticed in the upper panel of Fig-
ure 2, namely that reaction times for the vowel
[a] seem to increase with the duration of the
vowel. This is an interesting result worth fur-
ther study since it does not follow from the the-
ory of quantity perception previously enter-
tained by the present author, namely that the
perception of quantity is relational, based on
the vowel to consonant ratio (Pind, 1986,
1995). In such a theory there would appear to
be no room for the finding that reaction times
increase with longer vowel durations.
Figure 2. Reaction times for the identification of the
stimuli used in the present experiment. Each point
shows the ordinary mean of the harmonic mean re-
action times of the individual participants, N = 10.
It is also of interest that no such trend is ap-
parent for the vowel [!], as seen in the lower
half of Figure 2. The reaction times of the end-
point stimuli (with vowel durations of 108 and
264 ms) as well as the stimuli with a 180 ms

long vowel in the two e-continua were sub-
jected to a two-way repeated measures
ANOVA with continuum (e475 and e510) and
vowel duration as the factors. Neither contin-
uum, F(1,9) = 0.2352, vowel duration, F(2,18)
= 0.7801 nor their interaction F(2,18) = 0.9796
were statistically significant.
General discussion
The present authors first study of the percep-
tion of quantity in Icelandic (Pind, 1986)
showed that the ratio of vowel to rhyme dura-
tion was the primary cue to the perception of
quantity in Icelandic. This ratio has the useful
property that it is invariant, or at least very
nearly so, over changes in speaking rate. This
is, of course, highly advantageous since it
eliminates a major ambiguity which temporal
speech cues are subject to, namely the effect of
speaking rate. It has often been hypothesized
that temporal speech cues are perceived by
some kind of taking-into-account mechanism
where the perceptual system normalizes for
speaking rate (Summerfield, 1981). If the
vowel to consonant ratio is indeed invariant
over changes in speaking rate then this would
seem to do away with the necessity of postulat-
ing such a mechanism of normalization.
Later studies (Pind, 1996, 1999) showed
that the perception of quantity in Icelandic is
more complex than originally hypothesized
since it turned that the spectrum of some vow-
els, in particular of the vowel [!], can also in-
fluence the perception of quantity. This finding
suggests that reaction times for quantity judg-
ments in words containing the vowel [!] would
be shorter than to words containing the vowel
[a], as confirmed in the present study. Unex-
pectedly, the present experiment also suggests
that there is a difference in reaction times to
long and short tokens of the vowel [a] with
shorter reaction times to words with short vow-
els. Further experiments investigating this issue
are in preparation.
References
Benediktsson, H. (1963). The non-uniqueness
of phonemic solutions: Quantity and stress
in Icelandic. Phonetica, 10, 133153.
Einarsson, S. (1927). Beitrge zur Phonetik der
islndischen Sprache. Oslo: A. W. Brgger.
Gibson, J. J. (1959). Perception as a function of
stimulation. In Koch, S. (ed) Psychology: A
study of a science. Volume I. Sensory, per-
ceptual and physiological formulations,
456501. New York: McGraw-Hill Book
Company.
Klatt, D. H. (1980). Software for a cas-
cade/parallel formant synthesizer. Journal of
the Acoustical Society of America, 67, 971
995.
Klatt, D. H., & Klatt, L. C. (1990). Analysis,
synthesis, and perception of voice quality
variations among female and male talkers.
Journal of the Acoustical Society of Amer-
ica, 87, 820857.
Nooteboom, S. G., & Doodeman, G. J. N.
(1980). Production and perception of vowel
length in spoken sentences. Journal of the
Acoustical Society of America, 67, 276
287.
Pind, J. (1986). The perception of quantity in
Icelandic. Phonetica, 43, 116139.
Pind, J. (1995). Speaking rate, VOT and quan-
tity: The search for higher-order invariants
for two Icelandic speech cues. Perception &
Psychophysics, 57, 291304.
Pind, J. (1996). Spectral factors in the percep-
tion of vowel quantity in Icelandic. Scandi-
navian Journal of Psychology, 37, 121131.
Pind, J. (1998). Auditory and linguistic factors
in the perception of voice offset time as a
cue for preaspiration. Journal of the Acous-
tical Society of America, 103, 21172127.
Pind, J. (1999). Speech segment durations and
quantity in Icelandic. Journal of the Acous-
tical Society of America, 106, 10451053.
Summerfield, Q. (1981). On articulatory rate
and perceptual constancy in phonetic per-
ception. Journal of Experimental Psychol-
ogy: Human Perception and Performance, 7,
10741095.

Emotion discrimination with increasing time windows
in spoken Finnish
Eero Vyrynen
1
, Juhani Toivanen
2
and Tapio Seppnen
3

1,3
University of Oulu
2
University of Oulu and Academy of Finland

Abstract
A study of the automatic discrimination of emo-
tion in three different time windows of speech is
presented. A large corpus of acted emotional
speech in Finnish was collected in five emo-
tional states: neutral, sadness, joy, anger and
tenderness. Automatic emotion discrimination
tests were performed in speaker-independent
scenarios using kNN classifiers with sequential
forwards-backwards floating search feature
selection algorithm. Three time windows were
used in the stimuli: vowel, utterance and mono-
logue. Human emotion discrimination tests, i.e.
listening tests, were performed to obtain base-
line data. The results indicate, firstly, that the
average emotion discrimination performance
levels did not differ at all for the human listen-
ers vs. the computer at the vowel level. Howev-
er, for the longer speech units, the human lis-
teners did better in the discrimination tasks,
with a very clear difference at the monologue
level. It can be concluded that, in emotion dis-
crimination, the human listener can utilize
more prosodic cues than the computer when the
speech unit increases (durationally and syntac-
tically).
Introduction
The human and/or automatic classifica-
tion/discrimination of emotional content from
speech has been studied for a number of lan-
guages, including minor languages such as Fin-
nish (Toivanen et al., 2004). There is now a rel-
atively large research literature on the automatic
discrimination of emotion. The common ap-
proach has been to adopt the multiple classifi-
cation task in the discrimination experiments;
that it, it is assumed that the emotional content
of the speech data can be compartmentalized
into a number of basic categories, such as an-
gry, happy, sad, bored, tender, etc.
The aim of the research outlined in this pa-
per is to investigate the performance levels of
automatic emotion discrimination for spoken
Finnish, and to complement the results with
baseline data obtained from human listening
tests. A set of basic emotions (portrayed by ac-
tors) was used, a focus being on the effect of
the duration, and the syntactic complexity, of
the speech unit on the discrimination level. A
major hypothesis is that intonational features
will serve as emotional cues for the human lis-
teners in speech units of increasing duration.
MediaTeam Speech Corpus
The speech data used in this study is an exten-
sion of the MediaTeam Speech Corpus, which
has been used in our previous research (Toiva-
nen et al., 2004). This new data includes a total
of 450 new simulated emotional speech sam-
ples in five emotional contexts, all repeated 10
times by 9 professional actors. The samples
were recorded in collaboration with the Helsin-
ki University of Technology (HUT) in an ane-
choic room at the University of Oulu.
The speech data consisted of multiple repe-
titions of the following emotions in Finnish
speech: neutral, sadness, joy, anger and
tenderness. The data were produced by pro-
fessional actors: nine actors (five men and four
women, aged between 26 and 45). None of the
subjects had any known pathologies of the la-
rynx or hearing. The subjects read out a passage
of some 80 words from a Finnish novel admit-
ting several emotional interpretations but not
containing, semantically, any obvious emotion-
al content; the average duration of the read pas-
sage (in the neutral state) was approximately
one minute. Each speaker produced, in a ran-
dom order, ten renditions of each emotional
state (however, an emotional state was never
repeated in the next rendition). The subjects
were not given any detailed instructions con-
cerning the emotional expressions (e.g. con-
cerning the distinction between cold anger and
hot anger); the subjects were given a full inter-
pretative freedom as to how the emotions
should be portrayed. There were a total of 450
emotional speech samples (ten samples for five
emotions by nine speakers).
The recording was carried out using a cali-
brated (calibration signal provided by Bruel &
Kjaer 4231) high quality condenser microphone

(Bruel & Kjaer 4188) and a DAT recorder (So-
ny DTC-690). The microphone preamp range
was set to 20-100dB, the maximum signal level
corresponding to 99dB. An additional -35dB
attenuated signal was also recorded with a sec-
ondary DAT recorder (Sony TCD-D8) to ensure
unclipped recording even in the case of the
most intense emotions. To enable accurate in-
tensity measurements, the calibration signal
was also recorded at the beginning of each tape.
The microphone was also positioned in a way
that the distance from the subjects lips would
remain 50 cm for the duration of the recording.
Recording of each sample was performed in the
same predefined semi-random order, in which
no emotion was repeated successively. Each
subject was standing while performing, and a 5-
minute break was held after every 10 rendi-
tions, with water available for the subject to
keep his or her voice quality as constant as
possible for the duration of the recording ses-
sion. All recordings were done with a 16-bit
resolution and a 48-kHz sampling rate.
The recorded data were segmented in such a
way that, for each rendition, three samples of
different durations were formed. The first time
window was a unit consisting of a vowel ex-
tracted from the running Finnish speech: the
first time window embraced [a:] from the word
taakkahan (indeed a burden) in a passage (the
suffix -han meaning indeed). The utterance
context for [a:] was: Taakkahan se vain on (It
is indeed a burden only), which formed the
second time window. The third time window
was the whole monologue containing the carrier
sentence and thus also the vowel (ca 1 minute
in duration).
Acoustic analysis
For the speech data, features were calcu-
lated using the f0Tool software. The f0Tool is a
software package for automatic prosodic analy-
sis of large quanta of speech data. The analysis
algorithm first distinguishes between the voiced
and voiceless parts of the speech signal using a
cepstrum based voicing detection logic (Ahma-
di & Spanias 1999) and then determines the f0
contour for the voiced parts of the signal with a
high precision time domain pitch detection al-
gorithm (Titze & Haixiang 1993). From the
speech signal, over forty acoustic/prosodic pa-
rameters were computed automatically.
The parameters were:
A) general f0 features: mean, 1%, 5%,
50%, 95%, and 99% values of f0
(Hz), 1%- 99% and 5%-95% f0
ranges (Hz)
B) features describing the dynamics of
f0 variation: average continuous f0
rise and fall (Hz), average f0 rise
and fall steepness (Hz/cycle), max
continuous f0 rise and fall (Hz), max
steepness of f0 rise and fall
(Hz/cycle)
C) additional f0 features: normalised
segment f0 distribution width varia-
tion, f0 variance, trend corrected
mean proportional random f0 per-
turbation (jitter)
D) general intensity features: mean,
median, min, and max RMS intensi-
ties, 5% and 95% values of RMS in-
tensity, min-max and 5%-95% RMS
intensity ranges
E) additional intensity features: normal-
ised segment intensity distribution
width variation, RMS intensity vari-
ance, mean proportional random in-
tensity perturbation (shimmer)
F) durational features: average lengths
of voiced segments, unvoiced seg-
ments shorter than 300ms, silence
segments shorter than 250ms, un-
voiced segments longer than 300ms
(not used for sentence length sam-
ples), and silence segments longer
than 250ms (not used for sentence
length samples), max lengths of
voiced, unvoiced, and silence seg-
ments
G) distribution and ratio features: per-
centages of unvoiced segments
shorter than 50ms, between 50-
250ms, and between 250-700ms, ra-
tio of speech to long unvoiced seg-
ments (speech = voiced + un-
voiced<300ms), ratio of voiced to
unvoiced segments, ratio of silence
to speech (speech = voiced + un-
voiced<300ms)
H) spectral features: proportions of low
frequency energy under 500 Hz and
under 1000 Hz
For the utterance-length analysis, two fea-
tures describing long pause duration parameters
could not be used as no such pauses existed due
to the short duration of the utterance; see sec-
tion F above. For the vowel, the durational and
distribution/ratio features were not available.

Emotion discrimination proce-
dure: automatic classification and
listening tests
All computer evaluations of emotion discrimi-
nation performance with prosodic feature anal-
ysis were carried out using a kNN classifier in
conjunction with a forwards-backwards float-
ing-search feature selection algorithm (Pudil et
al. 1994). The criterion for optimality was set to
be the average classification accuracy using k
values of 1, 3, 5 and 7, with the maximum
search dimension of 15 features, and 5 classes
(neutral, sadness, joy, anger, and ten-
derness). All classifiers were tested using a 9-
fold (number of persons used to produce the
speech samples) cross validation. In this proce-
dure, each persons samples are tested in turn
against all the other persons samples. The re-
sulting classification performances are then av-
eraged to produce the final estimate of classifi-
cation performance. All tests were thus con-
ducted using a person-independent setup,
where, for each sample to be tested, no other
samples of that person were ever included in
the training database.
For a comparison with the automatic me-
thod results, human listening experiments were
conducted. Tests were carried as forced-choice
tests with the given five emotions, not contain-
ing any distracters. The listeners (15) were all
female students of nursing in their twenties.
The listening experiments were conducted over
a period of half a semester, starting with vowel
listening and ending with passage-length listen-
ing (3x450=1350 samples in total). A laptop
computer and a pair of high quality speakers
were used in playing out all the samples for the
audience. The same location, a class-room in a
very quiet environment, was used throughout
the listening tests. Samples were played in a
random order in each speech category ((a) vo-
wel, (b) utterance, and (c) monologue). As each
listening session consisted of 30 samples, with
a five-second break between the samples, the
average duration for the session was 45 minutes
for the monologue-level tests. Needless to say,
the listening tests, with a total of 45 separate
sessions, were something of an endurance test
for the subjects.
Results
The results are reported as confusion matrices
chosen from parts of feature search curves
where the performance has reached a plateau
level. It is assumed that real performance in-
crease is not gained by adding more features
past the plateau point, due to the risk of over
fitting and increasing the requirements on the
training data (curse of dimensionality). In re-
porting the results, the choice of k = 5 was se-
lected as a reasonable tradeoff between model
bias and variance that still produces acceptable
results. The feature vectors used for the selected
classifier setups are presented. In the confusion
matrices below, the emotions are indicated as
follows: 1-neutral, 2-sadness, 3-joy, 4-
anger, and 5-tenderness. Tables 1-3 report
the discrimination levels achieved by the com-
puter and Tables 4-6 report those for the human
listeners in percentages.
Table 1. Emotion discrimination / (a) vowel; com-
puter: 43.0%.
1 2 3 4 5
1 36.0 12.4 25.8 15.7 10.1
2 13.5 32.6 16.9 7.9 29.2
3 23.3 10.0 46.7 8.9 11.1
4 22.2 7.8 7.8 56.7 5.6
5 14.6 24.7 13.5 4.5 42.7
Table 2. Emotion discrimination / (b) utterance;
computer: 43.8%.
1 2 3 4 5
1 33.3 28.9 11.1 11.1 15.6
2 23.3 32.2 4.4 11.1 28.9
3 18.9 7.8 51.1 11.1 11.1
4 21.1 5.6 18.9 45.6 8.9
5 12.2 23.3 3.3 4.4 56.7
Table 3. Emotion discrimination / (c) monologue;
computer: 54.9%
1 2 3 4 5
1 60.0 10.0 22.2 0.0 7.8
2 15.6 56.7 22.2 1.1 4.4
3 10.0 3.3 68.9 12.2 5.6
4 8.9 12.2 30.0 45.6 3.3
5 21.1 18.9 13.3 3.3 43.4
Table 4. Emotion discrimination / vowel; human
listeners: 42.4%.
1 2 3 4 5
1 46.3 18.1 15.3 9.9 10.4
2 19.5 40.7 9.7 5.0 25.0
3 21.9 16.8 34.7 20.6 6.1
4 23.3 7.0 9.4 58.7 2.7
5 20.9 29.7 16.3 1.5 31.7

Table 5. Emotion discrimination / utterance; human
listeners: 56.3%.
1 2 3 4 5
1 67.3 14.7 3.5 8.5 6.1
2 17.9 59.3 3.8 3.4 15.7
3 21.4 7.9 45.8 18.4 6.4
4 22.3 6.3 3.5 65.8 2.1
5 20.3 26.4 9.2 0.8 43.3
Table 6. Emotion discrimination / monologue; hu-
man listeners: 71.1%.
1 2 3 4 5
1 81.4 4.2 3.4 7.4 3.6
2 11.9 72.0 0.9 3.3 11.9
3 10.5 2.5 74.8 7.3 4.9
4 13.3 6.4 3.8 75.3 1.1
5 15.7 14.4 16.9 0.8 52.2
In the automatic discrimination, the feature
vectors were the following:
a) mean f0, proportion of low fre-
quency energy under 500 Hz,
and mean RMS intensity (for
vowel)
b) mean RMS intensity, max RMS
intensity, 95% value of RMS in-
tensity, 5% value of RMS inten-
sity, normalised segment inten-
sity distribution width variation,
and normalized segment f0 dis-
tribution width variation (for ut-
terance)
c) jitter, max steepness of f0 rise,
max steepness of f0 fall, median
RMS intensity, 95% value of
RMS intensity, average length of
silence segments longer than
250ms, ratio of silence to
speech, and normalised segment
f0 distribution width variation
(for monologue).
Discussion and conclusion
It has been established that, for spoken Finnish,
automatic emotion discrimination can reach le-
vels that are close to human emotion discrimi-
nation performance (Toivanen et al. 2004). The
results of the present investigation would seem
to suggest that, under optimal circumstances
(for a vowel-length speech unit), the computer
can produce discrimination results comparable
to that of a group of human listeners (43.0 % v.
42.4 %). However, when the task is to discri-
minate between emotions in longer speech units
(monologues), the human listener is far superior
(71.1 % vs. 54.9 %).
The f0-related parameters signaling emo-
tional content in speech can effectively and ex-
tensively manifest themselves only if the
speech unit is long enough (above the utter-
ance/sentence level). Only then can the poten-
tial of intonation patterns as such be used to
convey affect. It is clear that this possibility ex-
ists in spoken Finnish, too: longer speech units
over and above word-level and sentence-level
prosody contain essential prosodic information
on affect. These findings offer at least indirect
evidence in favor of the argument that intona-
tional long-term features (spanning over utter-
ances) signal affect in Finnish, as they do in
other languages; cf. e.g. OConnor & Arnold
(1973). The f0 features analyzed and utilized by
the computer (e.g. max steepness of f0 rise,
max steepness of f0 fall) obviously correlate
with the global prosodic structure of longer
speech units, but they do not directly reflect in-
tonation as such (e.g. utterance-final intonation
contours). The human listener essentially hears
intonation, whenever possible, and analyzes
emotional contrasts on the basis of such pho-
nologically structured prosodic information.
References
Ahmadi S. and Spanias A.S. (1999) Cepstrum-
based pitch detection using a new statistical
V/UV classification algorithm, IEEE Trans-
action on Speech and Audio Processing.
Vol. 7, NO. 3, 333-338.
OConnor J.D. and Arnold G.K. (1973) Intona-
tion of Colloquial English. London: Long-
man.
Pudil P., Novoviov J. and Kittler J. (1994)
Floating search methods in feature selec-
tion. Pattern Recognition Letters 15 (11),
1119-1125.
Titze I.R. and Haixiang L. (1993) Comparison
of F0 extraction methods for high-precision
voice perturbation measurements. Journal of
Speech and Hearing Research, Vol. 36,
1120-1133.
Toivanen J, Vyrynen E. and Seppnen T.
(2004) Automatic discrimination of emotion
from spoken Finnish. Language and Speech
47, 383-412.


Looking at tongues can it help in speech perception?
Preben Wik, Olov Engwall

Centre for Speech technology, School of Computer Science and Communication, KTH, Sweden

Abstract
This paper describes the contribution to speech
perception given by animations of intra-oral
articulations. 18 subjects were asked to identify
the words in acoustically degraded sentences in
three different presentation modes: acoustic
signal only, audiovisual with a front view of a
synthetic face and an audiovisual with both
front face view and a side view, where tongue
movements were visible by making parts of the
cheek transparent. The augmented reality side-
view did not help subjects perform better over-
all than with the front view only, but it seems to
have been beneficial for the perception of pala-
tal plosives, liquids and rhotics, especially in
clusters.
Introduction
It is well established that visual information
support speech perception, especially if the
acoustic signal is degraded (Sumby and Pollack
1954). Not only hearing-impaired listeners but
also normal-hearing listeners benefit from in-
formation given by the face, and it has been
shown, in e.g. Agelfors et al (1998), that this
gain is not only provided by a natural face, but
also by synthetic faces.
Many phonemes are however impossible to
identify by looking at the speakers face, since
the articulation of the tongue cannot be seen
when the place of articulation is too far back.
Would it be beneficial to supplement the acous-
tic signal and speech reading of the face with
additional visualization of tongue movements?
An application has been developed in a joint
showcase by KTH and LORIA, Nancy, France,
within the European Network of Excellence
MUSCLE, to investigate the potential benefit
of such an augmented reality display with two
groups in mind.
1) A community of hearing-impaired persons
that rely on cued speech, (where additional
phonetic information is conveyed with hand
sign gestures) (Cornett and Daisey1992).
2) Second language learners that may find it
difficult to perceive or produce phonetic con-
trasts that do not exist in the mother tongue.

Since we are normally unaccustomed to seeing
the movements of the intra-oral articulators, it
remains an open question if such information
may be efficiently employed by listeners.
Two recent experiments (Tarabalka et al.
2007; Graunwinkel et al. 2007) investigated if
consonant identification in CVC words could
be enhanced by animations of tongue move-
ments when the speech signal was noisy. The
studies showed that the display of tongue
movements did improve consonant perception,
but only if the subjects were first allowed to
grow accustomed to the new type of visual in-
formation
The implications of the two studies for gen-
eral speech perception are nevertheless limited,
since only forced-choice identification of con-
sonants was tested. If the articulatory display is
to be used as a speech perception support, we
need to investigate if the intra-oral visualization
can improve perception of a more complex
content. In this study we therefore test a talking
head with intra-oral animations as a support for
word recognition in sentences.
The augmented visualization head
The MUSCLE visualization display consists of
a double view of the face, from the front and
the side, with transparent cheek, as shown in
Fig. 1. The face, jaw and tongue models are
based on 3D-wireframe meshes that are de-
formed by parametric weighted transformations
(Beskow 1997). The tongue model is based on
a statistical analysis of Magnetic Resonance
Images of a Swedish subject producing vowels
and consonants in three symmetric vowel-
consonant-vowel (VCV) contexts (Engwall
2003). Synthesized face and tongue movements
can be created from a string of phonetic charac-
ters as input, using a rule-based audiovisual
synthesizer. Interpolated parameter trajectories
are created from the phoneme strings, taking
visual coarticulation into account (Cohen and
Massaro 1993). For the tongue movements, the
coarticulation and timing is modelled on Elec-
tromagnetic Articulography data (Engwall
2003).
The MUSCLE visualization head takes an
acoustic utterance as input, generates the corre-


sponding articulatory movements in the syn-
thetic face and presents the animations syn-
chronized with the acoustic signal. The articu-
latory movements are synthesized based on
phoneme recognition of the spoken utterance.
In the perception test reported here, speech
recognition is not used. Instead, the input to the
visual speech synthesis is force-aligned label
files of the pre-recorded utterances.
Figure 1. Visual interface in the speech perception
test. Three conditions were tested: (AO) Neither of
the two faces shown; (AF) Front view of the face
shown; (AFT) both front and side view shown
Stimuli and Subjects
The stimuli consisted of 60 short Swedish sen-
tences spoken by a male Swedish speaker. The
sentences have a simple structure (subject,
predicate, object) and "everyday content", such
as "Kappan hnger i garderoben" (The coat
hangs in the wardrobe) or "Laget frlorade
matchen" (The team lost the game). These sen-
tences are part of a set of 270 sentences de-
signed for audiovisual speech perception tests
by hngren, based on MacLeod and Summer-
field (1990).
The sentences were presented in three dif-
ferent conditions: Acoustic Only (AO), Audio-
visual with Face (AF), Audiovisual with Face
and Tongue (AFT). For all conditions the
acoustic signal was degraded and the audio-
only condition provides a baseline intelligibility
level. Two levels of audio degradation were
used, to study the benefit of the visual informa-
tion at two different simulated levels of hearing
loss. A noise-excited channel vocoder with 2 or
3 frequency channels was used to reduce the
spectral details and create an amplitude modu-
lated and bandpass filtered speech signal con-
sisting of multiple contiguous channels of white
noise over a specified frequency range (Sicili-
ano et al 2003). The test was set up so that the
30 first stimuli were presented with three fre-
quency channels and the 30 last with only two.
The difficulty of the task was hence increased
halfway into the test for all subjects.
18 normal-hearing subjects participated in
the experiment. All were current or former uni-
versity students and staff. They were divided
into three groups, where the only difference be-
tween the three groups was that the sentences
were presented in different conditions to differ-
ent groups. This was made so that every sen-
tence was presented in all three conditions. The
sentence order was random, but the same for all
subjects.
Experimental set-up
The graphical interface for the perception test,
shown in Fig. 1, consisted of an upper display
with the animations of the face and a lower part
where the test subjects could type in their an-
swers. The task was to identify as many words
as possible in each sentence. The subjects were
allowed to repeat the stimuli as many times as
they wished before giving their answer. Repeti-
tions were allowed since the sentence test mate-
rial is quite complex and involves rapid tongue
movements.
The acoustic signal was presented over
headphones and the graphical interface was dis-
played on a 15" laptop computer screen. The
perception experiment started with a familiari-
zation set of sentences in AFT condition. The
subjects were instructed to prepare for the test
by listening to a set of five vocoded and five
clear sentences accompanied by the double
view of the synthetic face. Each familiarization
sentence could be repeated as many times as
the subject wanted. The correct answer could
then be displayed upon request from the subject
in the familiarization phase (no feedback was
given on the subjects' answers during the test).
When the subjects felt prepared for the actual
test, they started it themselves. The entire ex-
periment, including familiarization and test,
lasted 30-40 minutes.
Data analysis
The subjects' written replies were saved in
XML format and then analyzed manually. For
each stimuli sentence, the presentation condi-
tion (AO, AF, AFT), the number of times the
stimuli was played and the subject's answer was
stored together with the correct sentence text.
The word accuracy was then counted disregard-
ing morphologic errors.


The analysis focused on relating the word
accuracy scores especially to the factors: pres-
entation condition (Did visual cues improve the
word recognition score?), stimuli sentence
(Were some sentences better recognized in one
condition? If so, why?), acoustic degradation
(Was there any difference in visual contribution
between the two levels of acoustic informa-
tion?) and number of repetitions (Did addi-
tional visual information require more repeti-
tions?).
Results
Fig. 2 shows the overall scores for the three dif-
ferent conditions, averaged over subjects in the
tree different groups. The results for the two
audiovisual conditions were better than the
acoustic only for both levels of audio degrada-
tion. A two-sided t-test showed that the differ-
ences were significant at a level of p<0.05 for
three channels and p<0.0005 for two channels.
The performance on the two audiovisual condi-
tions was almost identical: AF (standard devia-
tion SD=19 for three channels 3C, SD=20 for
two channels 2C) and AFT (3C: SD=15, 2C:
SD=14). Overall, the augmented reality display
of the tongue movements did not improve the
performance. Fig. 2 however shows that the
performance differed substantially between the
groups, with higher accuracy in AFT condition
than in AF for the three channels signal for
groups 1 and 2, but lower for group 3. There
were further qualitative differences between the
two- and three-channel conditions.
0%
10%
20%
30%
40%
50%
60%
70%
80%
1 2 3
Group
W
o
r
d

a
c
c
u
r
a
c
y
AO:3 AF:3 AFT:3 AO:2 AF:2 AFT:2
This suggests that the phonetic content or
semantic complexity varied between the differ-
ent sentences. The mean performance on each
sentence in the different conditions was there-
fore analyzed. Fig. 3 shows the difference in
word accuracy rate between the two audiovis-
ual conditions and the acoustic only (bars in the
positive range hence indicating a better per-
formance in AF and AFT compared to AO).
From Fig. 3, one can identify the sentences for
which AFT was much better than AF (sen-
tences 9, 10, 17, 21, 22, 28, 30, 35, 37, 42, 56)
and vice versa (1-3, 6, 12, 27, 33, 52, 57).
For all but two (17 and 42) of the sentences
that were more intelligible with the AFT dis-
play than the AF, the difference in word accu-
racy score is between groups 1 and 2. Since the
overall AFT score for these two groups were
better than the AF score, the differences may be
attributed to the phoneme sequences in the test
sentences.
An analysis of the sentences that were better
perceived in AFT than in AF condition indi-
cates that subjects found additional information
in AFT mainly for palatals [k, g], the liquid [l]
and the rhotic [r]. In particular consonant clus-
ters with palatal plosives and liquids, [kl, rk],
but also other clusters with [l, r], such as [dr, tr,
lj, ml, pl, pt], were better perceived with anima-
tions of the tongue. The effect was not univer-
sal, for all occurrences of the clusters or all
subjects. The results nevertheless suggest that
subjects were able to extract information from
the animation of the raising of the tongue tip
(for [r, l]) or tongue dorsum (for [k, g]), that
may be difficult to perceive from a front face
view.
The sentences that were better perceived in
the AF condition contained more bilabials and
labiodentals than the average sentences, indi-
cating that the AFT improvement comes at a
price: Subjects may miss information that is
clearly visible in the AF view when concentrat-
ing on the tongue.
Figure 2. Mean word identification scores per
group, and for the three conditions AO: Acoustic
Only, AF: Audiovisual with Face, AFT: Audiovis-
ual with Face and Tongue. The filled bars refer to
three-channel vocoded speech, and the striped to
two-channels (more acoustically degraded).
Discussion and Conclusions
Overall the tongue movement display did not
give any additional support in sentence percep-
tion, compared to a front view of the face. This
is not a surprising finding, since the animated
tongue movements are unfamiliar, whereas
speech reading of a synthetic face can build on
human face-to-face communication. Some test
sentences that contained phonemes with attrib-
utes that are invisible in the face, such as pala-
tal plosives and/or liquids/rhotics, were better
perceived when the subjects were presented
with tongue animations. It thus appears that


subjects are in fact able to extract some infor-
mation about phonemes from the intra-oral ar-
ticulation, even in a more complex speech ma-
terial consisting of sentences. Some subjects
did however clearly benefit from the AFT view,
with the best subjects having 30% better word
recognition compared to the AO or AF cases.
Inter-subject variability was however very
high, and another subject from the same group
scored 48% lower in AFT than in AF for the
same sentences (two-channels).
It seems unrealistic that articulatory infor-
mation can be used as an alternative to cued
speech for real-time speech perception without
large amounts of training, due to the rapidity of
the tongue movements. It is easier to envisage
that intra-oral articulation displays can be bene-
ficial in computer-assisted pronunciation and
perception training applications, (Engwall et al
2006) where the user can repeat the animations
the desired number of times or even play them
in slow motion. In such an application, the ad-
ditional articulatory information conveyed by
the intra-oral animation may support the user in
establishing the articularoty-acoustic relation-
ship for the foreign phonemes.
Acknowledgements
The visual display used in the perception test
was partially developed in the Network of Ex-
cellence MUSCLE (Multimedia Understanding
through Semantics, Computation and Learn-
ing), funded by the European Commission. The
research was also supported by the Graduate
School of Language Technology (GSLT).
References
Agelfors, E., Beskow, J ., Dahlquist, M., Gran-
strm, B., Lundeberg, M., Spens, K-E., &
hman, T. (1998). Synthetic faces as a
lipreading support. Proceedings of ICSLP.
Beskow, J . (1997). Animation of Talking
Agents, Proceedings of AVSP'97, 149-152.
Cornett, O. & Daisey, M. E. (1992). The Cued
Speech Resource Book for Parents of Deaf
Children. National Cued Speech Ass.
-40%
-20%
0%
20%
40%
60%
80%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59
Stimuli
W
o
r
d

a
c
c
u
r
a
c
y
100%
AF change AFT change AO
Figure 3. The mean contribution for each sentence of the synthetic face in the two audiovisual conditions in
relation to the performance on the audio-only condition (line shows mean AO accuracy). Stimuli 1-30 has
vocoded speech with 3 channels, stimuli 31-60 with 2.
Cohen, M. and Massaro, D. (1993). Modeling
coarticulation in synthetic visual speech". In
D. Thalmann N. Magnenat-Thalmann,
(eds.), Computer Animation '93. Springer-
Verlag.
Engwall, O. (2003). Combining MRI, EMA &
EPG in a three-dimensional tongue model,
Speech Communication, vol. 41/2-3, 303-
329.
Engwall, O., Blter, O., ster, A-M., and Kjell-
strm, H. (2006). Designing the user inter-
face of the computer-based speech training
system ARTUR based on early user tests.
J ournal of Behavioural and Information
Technology, 25(4), 353-365.
Grauwinkel, K., Dewitt, B. and Fagel, S.
(2007). Visual Information and Redundancy
Conveyed by Internal Articulator Dynamics
in Synthetic Audiovisual Speech. Proceed-
ings of Interspeech 2007, 706-709.
MacLeod, A., and Summerfield, Q. "A proce-
dure for measuring auditory and audiovisual
speech-reception thresholds for sentences in
noise. Rationale, evaluation and recommen-
dations for use", British J ournal of Audiol-
ogy, 24:29-43, 1990.
Siciliano, C., Williams, G., Beskow, J ., &
Faulkner, A. (2003). Evaluation of a Multi-
lingual Synthetic Talking Face as a com-
munication Aid for the Hearing Impaired. In
Proc of Intl Conf of Phonetic Sciences, pp.
131-134
Sumby, W.H. and Pollack, I. (1954). Visual
Contribution to Speech Intelligibility in
Noise. J ournal of the Acoustical Society of
America, 26, 212-215.
Tarabalka, Y., Badin, P., Elisei, F. and Bailly,
G. (2007). Can you "read tongue move-
ments"? Evaluation of the contribution of
tongue display to speech understanding.
Proceedings of ASSISTH2007, 187-193.

Human Recognition of Swedish Dialects
Jonas Beskow
2
, Gsta Bruce
1
, Laura Enflo
2
, Bjrn Granstrm
2
, Susanne Schtz
1
(alphabetical or-
der)
1
Dept. of Linguistics & Phonetics, Centre for Languages & Literature, Lund University, Sweden
2
Dept. of Speech, Music & Hearing, School of Computer Science & Communication, KTH, Sweden

Abstract
Our recent work within the research project
SIMULEKT (Simulating Intonational Varie-
ties of Swedish) involves a pilot perception
test, used for detecting tendencies in human
clustering of Swedish dialects. 30 Swedish
listeners were asked to identify the geo-
graphical origin of 72 Swedish native speak-
ers by clicking on a map of Sweden. Results
indicate for example that listeners from the
south of Sweden are generally better at rec-
ognizing some major Swedish dialects than
listeners from the central part of Sweden.
Background
This experiment has been carried out within
the research project SIMULEKT (Simulating
Intonational Varieties of Swedish) (Bruce,
Granstrm & Schtz, 2007). Our object of
study is the prosodic variation characteristic
of seven different regions of the Swedish-
speaking area: South, Gta, Svea, with Dala
as a distinct subgroup, Gotland, North, and
Finland Swedish. The seven regions corre-
spond to our present dialect classification
scheme.
Speech material
One of our main sources for analysis is the
Swedish speech database SpeechDat (Elenius,
1999). SpeechDat contains speech recorded
over the telephone from 5000 speakers, regis-
tered by age, gender, current location and
self-labeled dialect type, according to Elert's
suggested Swedish dialect groups (Elert,
1994) that is a more fine-grained classifica-
tion with 18 regions in Sweden.
Introduction
Prosody, vowels and some consonant allo-
phones are likely to be important when trying
to decide from where a person originates. The
aim of this work is to develop a method
which could be of help in finding out how
well Swedish subjects can identify the geo-
graphical origin of other Swedish native
speakers. By determining the dialect identifi-
cation ability of Swedish listeners, a founda-
tion could be made for further research in-
volving dialectal clusters of speech. In order
to evaluate the importance of the factors
stated above for dialect recognition, a pilot
test was put together using recordings of iden-
tical utterances from 72 speakers.
In the Swedish SpeechDat database, two
sentences read by all speakers were added for
their prosodically interesting properties. One
of them was used in this experiment: Mobilte-
lefonen r nittiotalets stora fluga, bde bland
fretagare och privatpersoner. `The mobile
phone is the big hit of the nineties both
among business people and private persons.'
For this test, each of Elert's 18 dialect
groups in Sweden were represented by four
speakers, two female and two male, with an
age span as wide as possible.
Subjects
30 subjects participated in the experiment, 12
female and 18 male, with an average age of
32 and 33 years, respectively. Subjects were
placed in two groups depending on where the
majority of the childhood and adolescence (0-
18 years) had been spent. Seven females and
eleven males grew up in the central part
(Svealand) whereas five female and seven
male subjects were raised in the southern part
(Gtaland) of Sweden.
Experiment
The test was made with the scripting language
Tcl/Tk and carried out in Stockholm and
Lund. The experiment comprised a dialect-
test part, a geography test and a questionnaire.
In the dialect test, the SpeechDat stimuli
were played in random order over headphones
and could be repeated as many times as de-
sired before answering by clicking on a map
of Sweden.
The geography test included 18 Swedish
towns presented one by one in written form,
which were placed on the map in the same

manner as for the dialect test. These towns are
the most populated in each of Elert's dialect
group areas.
Lastly, a questionnaire was filled out by
all subjects, so as to provide information
about e.g. age, gender and dialectal back-
ground.
Results
Subjects vary very much in their ability to lo-
cate speakers. In Figure 1, results for two lis-
teners are displayed on the Swedish map.

Dark dots mark the correct dialect locations
and light dots the answers provided by the
subjects. Figure 2 displays the results from
the geography test in the same way. The two
subjects were chosen as typical representa-
tives of Svealand and Gtaland. Both were
males aged 25, but with different back-
grounds. Subject 1 from Svealand was born
and raised in Stockholm with parents from
Stockholm and had been exposed to regional
accents to a small extent. Subject 2 from
Gtaland was born and raised in J nkping
by parents from the same area.

Figure 1. Dialect test results for subject 1 from Svealand (left) and subject 2 from Gtaland (right). Dark dots
for correct locations are connected by lines with light dots for answers given by subject.

Figure 2. Geography test results for subject 1 from Svealand (left) and subject 2 from Gtaland (right). Dark
dots for correct locations are connected by lines with light dots for answers given by subject.

Figure 3. Dialect test results for speaker no. 1 from Svealand (left) and speaker no. 2 from Norrland (right).
Dark dot for correct location is connected by lines with light dots for answers given by all subjects.

Speakers in the test vary considerably as to
how consistently they are identified. An ex-
ample is displayed in Figure 3, which shows
where all subjects have placed speaker no. 1,
a 51-year-old female from Tby, Svealand
and speaker no. 2, a 55-year-old female from
Kiruna, Norrland.
Average placement errors
The average errors in dialect placement were
computed as an arbitrary unit distance on the
map. Figure 4 shows this mean for six differ-
ent Elert dialect areas (four speakers in each
area). The subjects are divided into Svealand
and Gtaland listeners (see Subject section).

0
50
100
150
200
250
300
350
S
k
n
e
,

f
a
r
s
o
u
t
h
G
t
e
b
o
r
g
S
t
o
c
k
h
o
l
m
G
o
t
l
a
n
d
U
p
p
e
r
D
a
l
a
r
n
a
N
o
r
r
l
a
n
d
,
f
a
r

n
o
r
t
h
18 14 8 7 5 1
D
i
s
t
a
n
c
e

f
r
o
m

c
o
r
r
e
c
t

l
o
c
a
t
i
o
n
,
a
r
b
i
t
r
a
r
y

u
n
i
t
Svealand
Gtaland

Figure 4. Gtaland and Svealand listeners average dialect location errors for speakers from six Elert dialect
areas.

Discussion and future work
Our data suggests that Svealand listeners are
less able to locate dialects, except their own
and the accent from Dalarna, which is geo-
graphically nearby. It is probable that human
listeners are better at identifying and locating
dialects originating from their own dialectal
area than those coming from more distant re-
gions. However, the Gtaland listeners were
also good at locating Svealand speakers, pos-
sibly due to the great exposure of these dia-
lects in media. The high error values for the
far north part of Norrland may be explained
by the longer distances between towns and
different-sounding dialects in that area, but
also in part because of the subjects' lesser ex-
posure to northern accents. These are some
examples of results of the dialect location
test. Further analysis of the data is planned in
the near future, particularly using full statisti-
cal analysis. A possible extension is to use
segmentally neutralized stimuli, to focus on
the prosodic features of Swedish regional va-
rieties. We also wish to use listener clustering
as a tool in deciding which factors play the
most important roles for distinguishing the
different Swedish dialect types, which might
lead to modified dialect taxonomy.
Acknowledgements
This work is supported by a grant from the
Swedish Research Council.
References
Bruce G., Granstrm B. and Schtz S. (2007)
Simulating Intonational Varieties of
Swedish. Proceedings of ICPhS XVI,
Saarbrcken, Germany.
Elenius K. (1999) Two Swedish SpeechDat
databases some experiences and results.
Proceedings of Eurospeech 99, 2243-
2246.
Elert C.-C. (1994) Indelning och grnser
inom omrdet fr den nu talade svenskan
en aktuell dialektografi. In Kulturgrn-
ser myt eller verklighet. (Edlund, L.E.
(Ed.)). Ume, Sweden: Diabas, 215-228.


F
0
in contrastively accented words in three Finnish dia-
lect areas
Riikka Ylitalo
Phonetics, Oulu University

Abstract
Accent in Finnish is realized mainly through an
F
0
rise-fall -pattern. The phonetic realization of
Finnish accent has so far been investigated
most systematically in Northern Finnish. This
study looked at accentuation also in two West-
ern Finnish dialects. It turned out that F
0

reaches a higher level in Northern Finnish ac-
cented words than in those of the Western Fin-
nish dialects, and in the Western dialect of
Turku the timing of the F
0
rise-fall -pattern in
the CV.CV(X) words is different than in all
other words investigated.
Introduction
Contrastive accent in Finnish is realised by an
F
0
rise-fall, additionally it lengthens segment
durations in certain parts of the word (Suomi,
Toivanen & Ylitalo (in preparation), Suomi,
Toivanen & Ylitalo 2006, 239). So far the pho-
netic realisation of accentuation in Finnish has
been studied most systematically in Northern
Finnish. In Northern Finnish F
0
normally rises
during the first mora of the accented word and
falls mainly during its second mora. This rise-
fall -pattern is uniform in words whose seg-
mental structure is different. (Suomi, Toivanen
& Ylitalo 2006, 225-227.)
Method
The aim of this study is to investigate how con-
trastive accent is realised phonetically by
speakers from three different dialect areas of
Finnish, from Oulu, Turku and Tampere re-
gions. Oulu belongs to the area of Northern
Finnish and the other cities belong to the west-
ern dialect area of Finnish, more accurately
Turku belongs to the South-West dialect area
and Tampere to the Hme dialect area. In this
study there were 6 informants from each dialect
area; the informants were born, or have lived
since early childhood in the area they represent.
The informants are all women, and they were
18-25 year-old students at the time of the re-
cordings. The informants read the target words
embedded in frame sentences from a computer
screen in a studio. Their speech was recorded
(44.1 kHz, 16 bit) directly to hard disc, the
Tampere informants speech was first recorded
on a mini-disc and later copied to hard disc. It
needs to be pointed out that the speech material
used in this study is definitely not dialect in the
proper meaning of the word, even though the
word dialect is used to describe the informants
backgrounds; the speakers spoke their locally
coloured variants of Standard Finnish.
The material consisted of 10 words repre-
senting each of the structures CV.CV,
CV.CV.CV and CV.CV.CVC.CV (for example
sika, sikala, sikalasta), similarly 10 words of
each of the structures CVV.CV, CVV.CV.CV
and CVV.CV.CVC.CV (for example siika, Sii-
kala, Siikalasta), 5 CVC
a
.C
a
V, 5 CVC
a
.C
a
V.CV
and 5 CVC
a
.C
a
V.CVC.CV -words (for example
sepp, Seppl, Sepplst), 3 CVC
aVoice-
less
.C
b
V, 3 CVC
aVoiceless
.C
b
V.CV and 3 CVC
a-
Voiceless
.C
b
V.CVC.CV -words (for example
sotka, Sotkamo, Sotkamosta) and 2 CVC
a-
Voiced
.C
2
V, 2 CVC
aVoiced
.C
b
V.CV and 2 CVC
a-
Voiced
.C
b
V.CVC.CV -words (for example kanta,
kantama, kantamasta). Nearly all of the C
2
s of
CV.CV(X) and CVV.CV(X) words and the
C
a
.C
a
sequences of the CVC
a
.C
a
V(X) words are
voiceless. Altogether there were 1620 target
word tokens, 540 from each dialect area.
The target words were placed in the frame
sentences in a position in which they would be
contrastively accented. The informants were
also asked to emphasize capitalised words. For
example, the target word koti home was
placed in the following frame sentence: Sanoin
ett Annan KOTI paloi, en sanonut ett Annan
KOULU paloi I said that Annas HOME burnt,
I didnt say that Annas SCHOOL burnt. The
F
0
values were measured with Praat at the fol-
lowing points: at the beginning and at the end
of the syllable preceding the target word, at the
beginning, at the middle and the end of the first
syllable of the target word as well as at the
temporal midpoint between the beginning and
the middle, and at the temporal midpoint be-
tween the middle and the end. That is, F
0
was
measured at five equidistant points of the first
syllable. And similarly for the second syllable.

In the third syllable F
0
was measured at the be-
ginning, at the middle and at the end and in the
fourth syllable at the beginning and at the end.
There were more measurement points in the
first and the second syllables of the words than
in later syllables because the accentual F
0
curve
is mostly realised during the first two syllables,
and also because especially the fourth syllable
is often reduced. F
0
was also measured at the
beginning, at the middle and at the end of the
syllable following the target word, and at the
highest peak in the target word. Also the tem-
poral location of the highest F
0
peak from the
word onset was measured. All the target words
produced by the informants were listened be-
fore making the analyses, and mispronounced
target words, as well as target words that were
produced unaccented, were rejected.
Results
Every measured F
0
-value was normalised be-
fore performing statistical analyses by subtract-
ing the F
0
-value at the beginning of the syllable
preceding the same target word from it, and
then adding the mean of the F
0
-values at the
beginning of syllables preceding all the target
words to it. In the CV.CV(X) words the F
0
of
the first syllable was in all five measurement
points higher in Oulu than in the other dialects,
which did not differ from each other by their F
0

values (point 1 [F(2,45) =11.04, p <0.001],
point 2 [F(2,45) =10.59, p <0.001], point 3
[F(2,45) =9.80, p <0.001], point 4 [F(2,45) =
8.16, p =0.001], point 5 [F(2,45) =7.22, p <
0.01]). At the three first measurement points of
the second syllable there were no significant F
0

differences between dialects, but at the last two
measurement points of the second syllable F
0

was higher in Turku than in Oulu, and Tam-
peres F
0
values did not differ from those of the
other dialects (point 4 [F(2,45) = 3.23, p <
0.05], point 5 [F(2,45) =4.34, p <0.05]). At the
first and the second measurement point of the
third syllable there were no significant F
0
dif-
ferences between dialects. At the third meas-
urement point of the third syllable F
0
was
higher in Turku than in Oulu, and Tampere
didnt differ from the other dialects [F(2,30) =
4.27, p <0.05]. At the fourth syllable the F
0

values were statistically similar in all the dia-
lects. In Oulu the F
0
peak was located approxi-
mately 172 ms, in Turku 264 ms and in Tam-
pere 207 ms from word onset. The location of
the peak was statistically the same in Oulu and
in Tampere, but in Turku the peak was signifi-
cantly later in the word than in the other dia-
lects [F(2,45) =10.09, p <0.001].
Figure 1. F
0
in the CV.CV(X) words (the segment
duration values represent the durations in the four-
syllabic words). The measurement points between
which additional comparisons were made (see be-
low) are marked with rectangles.
Because the segments have different durations
in different dialects, the syllable-bound meas-
urement points are located at different absolute
distances from the word onset in different dia-
lects. It can be seen in Figure 1, at roughly 290
ms, that in the four-syllabic CV.CV(X) words
the first second-syllable Turku measurement
point is closer to the second than the first sec-
ond-syllable Oulu measurement point. There
was no significant F
0
difference between Oulu
and Turku at the respective first second-
syllable measurement points, but when the
temporally close measurement points were con-
sidered (those inside the leftmost rectangle in
Figure 1), there was a significant difference: F
0

was higher in Turku than in the other dialects,
in which F
0
level was statistically the same
[F(2,51) =4.10, p <0.05]. A similar compari-
son was made between Oulu and Tampere sec-
ond-syllable third measurement points and
Turku second-syllable second measurement
point (the measurement points within the mid-
dle rectangle), and it turned out that F
0
was
higher in Turku than in the other dialects, in
which the F
0
values were statistically same
[F(2,51) =6.56, p <0.01]. A comparison be-
tween Oulu and Turku second-syllable fourth
measurement points and Tampere second-
syllable fifth measurement point (the rightmost
rectangle) revealed that F
0
was still higher in
Turku than in Oulu, but Tampere did not differ
statistically from the other dialects [F(2,51) =
3.88, p <0.05]. Some similar F
0
comparisons
between nominally different, but temporally
close measurement points were made also in
the other word structures than CV.CV(X) in
situations where there were no significant F
0


differences between the dialects at the nomi-
nally same measurement points, but there were
nominally different measurement points tempo-
rally closer to each other than the nominally
same ones. However, none of those compari-
sons revealed any significant dialect differ-
ences.
In the CVV.CV(X) words (see Figure 2), F
0

was higher in Oulu than in the other dialects
(which did not differ from each other) at the
first [F(2,45) =9.62, p <0.001], at the second
[F(2,45) =10.37, p <0.001] and at the third
[F(2,45) =6.88, p <0.01] measurement points
of the first syllable. In later measurement points
there were no significant F
0
differences be-
tween the dialects, except the third-syllable
third measurement point, where F
0
was lower in
Oulu than in Turku, and Tampere F
0
did not
differ from those of the other dialects [F(2,30)
=4.09, p <0.05]. The location of the F
0
peak
was statistically similar in all the dialects.
Figure 2. F
0
in the CVV.CV(X) words (the duration
values represent the durations in the four-syllabic
words).
In the CVC
aVoiceless
.C
b
V(X) words (see Figure
3), F
0
was higher in Oulu than in the other dia-
lects (which did not differ statistically from
each other) at the first-syllable first measure-
ment point [F(2,45) =9.48, p <0.001] and at
the first-syllable second measurement point
[F(2,45) = 9.30, p < 0.001]. At the third
[F(2,45) =8.56, p =0.001], fourth [F(2,45) =
7.03, p <0.01] and fifth [F(2,45) =5.60, p <
0.01] measurement points of the first syllable
F
0
was higher in Oulu than in Tampere, but
Turku did not differ from the other dialects.
Later in the word there were no significant F
0

differences between the dialects. The dialects
did not differ from each other by the location of
the F
0
peak.
Figure 3. F
0
in the CVC
aVoiceless
.C
b
V(X) words (the
syllabic words).
In the CVC
aVoiced
.C
2
V(X) words (Figure 4), F
0

was higher in Oulu than in the other dialects
(which did not differ from each other statisti-
cally) at the four first measurement points of
the first syllable (point 1 [F(2,45) =15.23, p <
0.001], point 2 [F(2,45) =15.34, p <0.001],
point 3 [F(2,45) =15.47, p <0.001], point 4
[F(2,45) =7.57, p <0.001]). At the later meas-
urement points of CVC
aVoiced
.C
2
V(X) words,
there were no significant F
0
differences be-
tween the dialects. Also the location of the F
0

peak was statistically the same in all the dia-
lects.
Figure 4. F
0
in the CVC
aVoiced
.C
2
V(X) words (the
syllabic words).
In the CVC
a
.C
a
V(X) words (Figure 5) F
0
was
higher in Oulu than in the other dialects (which
did not differ from each other statistically) at
the first-syllable first [F(2,45) = 12.81, p <
0.001] and first-syllable second [F(2,45) =
12.70, p <0.001] measurement points. At the
third [F(2,45) =9.41, p <0.001], at the fourth
[F(2,45) =7.96, p =0.001] and at the fifth
[F(2,45) =6.92, p <0.01] measurement points
of the first syllable F
0
was higher in Oulu than
in Tampere, but Turku did not differ from the
other. Later in the CVC
a
.C
a
V(X) words there
were no significant F
0
differences between the

dialects, and the location of the F
0
peak was
statistically the same in all the dialects.

Figure 5. F
0
in the CVC
a
.C
a
V(X) words (the
duration values represent the durations in the
four-syllabic words).
An additional experiment
It turned out above that the Turku CV.CV(X)
words differ from all the other words investi-
gated: in these words, F
0
is at its highest at the
second-syllable first measurement point, not
during the first syllable. Because nearly all of
the C
2
s of the CV.CV(X) words in the original
material are voiceless, it remained unclear
where the peak of F
0
would be if C
2
were
voiced. To resolve this question, some extra
materials were recorded: 5 informants, who
were among the Turku informants in the first
recordings, read 30 CV.CV words whose both
consonants were voiced. The words were
placed in frame sentences in positions where
they would have contrastive accent, for exam-
ple the target word lumi snow was placed in
the sentence Sanoin ett kaikki LUMI katosi, en
sanonut ett kaikki LOSKA katosi I said that
all SNOW disappeared, I didnt say that all
SLUSH disappeared. The recordings were per-
formed technically in the same way as the first
recordings. F
0
was measured at five equidistant
points of the first syllable in the same manner
as in the first study, in the second syllable F
0

was measured at seven points: in addition to the
five measurement points used in the first study,
F
0
was also measured at the temporal midpoint
between the first and the second measurement
points and at the temporal midpoint between
the second and the third measurement points. F
0

in the syllables preceding and following the tar-
get word was measured in the same way as in
the first study. Also the temporal location and
the Hz value of the F
0
peak within the word
was measured.
Figure 6. F
0
and segment durations in the new
Turku CVCV words (with voiced consonants). The
mark representing the F
0
peak is indicated by an
arrow.
Figure 6 shows the main result of the additional
experiment: in Turku dialect CV.C
2
V words F
0

reaches its peak value a little before the middle
of C
2
, if C
2
is voiced; in the extra material tar-
get words the F
0
peak is located approximately
215 ms from word onset.
Discussion
In all the word structures investigated, F
0
was
higher in Oulu than in the other dialects, in the
first syllable, or at least at the beginning of the
first syllable. Otherwise, the F
0
curves were
quite similar across the dialects, with one ex-
ception: in the CV.CV(X) words of Turku dia-
lect, the F
0
peak occurred a little before the
middle of C
2
(as observable in words in which
this consonant is voiced). In all other word
structures investigated the F
0
peak occurred at
the end of the first syllable. In the second sylla-
ble of the CV.CV(X) words, F
0
is also higher in
Turku than in the other dialects. (C)V.CV(X) is
the only word structure in Finnish, in which the
words second mora is in the second syllable of
the word. Obviously Turku dialect manages
this situation in a way that is different from that
used in the two other investigated dialects. Also
the segment durations in Oulu, Turku and
Tampere dialects in the word structures dis-
cussed in this paper have been investigated; the
results of those investigations will be reported
in the future.
References
Suomi K, Toivanen J . & Ylitalo R. (2006).
Fonetiikan ja suomen nneopin perusteet.
Helsinki: Gaudeamus.
Suomi K, Toivanen J . & Ylitalo R. (in prepa-
ration). Finnish sound structure.

Improving speaker skill in a resynthesis experiment
Eva Strangert
1
and Joakim Gustafson
2

1
Department of Language Studies, University of Ume
2
CSC, Department of Speech, Music and Hearing, KTH

Abstract
A synthesis experiment was conducted based on
data from ratings of speaker skill and acoustic
measurements in samples of political speech.
Features assumed to be important for being a
good speaker were manipulated in the sample
of the lowest rated speaker. Increased F0 dy-
namics gave the greatest positive effects, but
elimination of disfluencies and hesitation paus-
es, and increased speech rate also played a
role for the impression of improved speaker
skill.
Introduction
The current study concerns public speaking
with a focus on qualities that characterize
speakers held to be good speakers. By that
we mean persons capable of adjusting their
speech to the maximum of what is possible in
order to get across to an audience. Although
public speakers vary in the extent to which they
meet this criterion of speaker skill, such a ca-
pability is a great asset, not the least in politics.
It is assumed that this capability to a great ex-
tent depends on prosody, in particular how
prosody is used to signal intentions of the
speaker and attitudes towards the listener.
In a resynthesis experiment we modify the
original sample of one of the speakers analyzed
in a previous study (Strangert, 2007) concerned
primarily with subjective ratings of speaker
qualities. The speakers were chosen so as to
represent a great variation in order to be able to
single out features with a potential for distin-
guishing good speakers from less good ones.
The study also included some restricted acous-
tic-prosodic measurements of the speech sam-
ples. In closely related studies based on English
and English and Arabic, respectively, Rosen-
berg and Hirschberg (2005) and Biadsy et al.
(2008) had subjects rate charisma and corre-
lated these ratings with a great number of
acoustic features.
In the present study we build on the analy-
ses in Strangert (2007) and, in addition, extend
the acoustic-prosodic analysis presented there
to include more features potentially relevant for
the impression of speaker skill.
Rating data
The speech sample used in the resynthesis ex-
periment was one of those (16 in total) ana-
lyzed in Strangert (2007). They were re-
cordings (audio and video) from debates in the
Swedish parliament (Riksdagen) representing a
variety of speakers (more and less skilled, male
and female) and styles (read and spontaneous).
The samples were rated by 18 native Swed-
ish subjects who gave their opinion on a num-
ber of qualities on a five-point scale from no,
absolutely not (0) to yes, absolutely (4). The
ratings included an overall good-speaker rating,
good-speaker defined as a person capable of
catching the attention and interest of an audi-
ence through her/his way of communicating.
This rating was matched to all the other ratings
and found to have strong positive correlations
with expressive, powerful, involved and trust-
worthy (all with r .89; with correlations based
on means of all individual ratings for each qual-
ity). There were also positive correlations with
aggressive, accusatory and agitating (r . 65),
and negative with humble (r =-.55 and insecure,
hesitant and monotonous (r -.86).
The rating scores varied considerably be-
tween the speakers. The mean good-speaker
score varied from 3.39 for the speaker rated
highest to 0.56 for the lowest rated speaker.
Acoustic analysis
The acoustic measures included F0 range (in
semitones, and as a ratio of mean F0 maximum
of focused words to mean F0) and number of
focus positions. Further, minimum, maximum
and mean F0 and mean of F0 maximum of fo-
cused words were measured separately for the
male and female speakers. (Focused words
were identified by the first author through lis-
tening.) Measurements also included a number
of duration features and speech rate measures,
which we leave out here concentrating on those
features correlating strongly with good-
speaker.
Table 1 summarizes F0 measures and their
correlations with the (mean) good-speaker
scores. As shown, the number of focused words
varies considerably. There is a positive, corre-

Feature max min r p
Mean F0 max of foc
words/mean F0
1.85 1.17 .61 .01*
F0 range, 75-25
percentiles (ST)
8.78 2.44 .61 .01*
female 384 219 .87 .005**
male 233 150 .76 .03*
female 516 265 .82 .01*
male 325 190 .63 .09
female 246 180 .62 .10
male 164 113 .65 .08
.07
Mean F0 max of
focused words
F0 max
F0 mean
Focused words (N) 23 3 .47
lation of .47 with the good-speaker scores, but
it does not reach significance (p=.07).
Table 1. F0 measures and their correlations with
the mean good-speaker scores.

Significant and positive correlations include F0
range, measured in semitones and as a ratio of
mean F0 maximum of focused words to mean
F0. The two measures give similar results
(r=.61; p<.05) and we therefore confine our-
selves to the semitone data in the following.
Ranges in semitones between the 25% and 75%
points in the F0 distribution vary between 8.78
and 2.44 for individual speakers with a median
of 4.7. These figures may be compared to simi-
larly computed ranges (25% -75% distribution
points) extracted from 498 speakers in the Swe-
dish SpeeCon database (Carlson et al., 2004).
The great majority of these ordinary speakers
had a range of 2-5 semitones. Half of our
speakers ranges then fell outside this more re-
stricted interval.
There is a significant correlation of good
speaker with F0 maximum, but only for the fe-
male speakers (r=.82; p<.05). For the mean of
F0 maximum of focused words, on the other
hand, there is a positive correlation for both the
male (r=.76; p<.05) and female (r=.87; p<.01)
speakers. This feature is comparable to the
pitch range measure (mean HiF0, the highest
accented pitch peak) in Biadsy et al. (2008)
which correlated positively with charisma in
American English and Palestinian Arabic when
rated by American and Palestinians as well as
Swedish subjects. This feature in addition was
found to be more important for the Swedish
subjects than for the others
Thus F0 dynamics appears to be primarily
associated with the extent to which the range is
widened upwards; the correlation with F0 min-
imum is insignificant. Neither is mean F0 over
the speech sample (excluding silent pauses)
sig-nificant, although both female and male
speakers correlations exceed .6 (with p=.08
and .10, respectively). Biadsy et al. (2008) on
the other hand, found significant correlations
between mean F0 and charisma ratings, irre-
spective of the language of the raters, for both
American and Palestinian Arabic speech mate-
rials.
Fluency and speaking style
As some speakers were perceived as far more
fluent than others, a measure of disfluency was
calculated. The number of positions with a slip
of the tongue, a repetition or a repair was de-
termined through listening by the first author.
This measure showed a strong negative correla-
tion (r=-.72; p<.01) with good speaker. A simi-
lar negative correlation was found also in the
cross-cultural comparison by Biadsy et al.
(2008) with the exception of the Swedish
judgments of American English. We note this
difference between Swedish judgments of Swe-
dish and English, respectively, which may be
ascribed to cultural influences.
Disfluencies occur primarily in speech pro-
duced spontaneously as a result of problems
with the planning of what to say next. As some
of our speakers read from a manuscript and
some spoke more freely, we could relate the
disfluency scores to the read vs. spontaneous
style of speaking. Even though the three most
disfluent speakers were in fact speaking spon-
taneously, there was no obvious relation taking
all speakers into account. Neither was there any
obvious relation between speaking style and the
good-speaker rating.
Resynthesis experiment
In the acoustic analysis, F0 features, in particu-
lar a wide F0 range and high peaked focused
words, were found to give high ratings of
good speaker, while the opposite, a smaller
range and focused words with lower peaks, was
given low ratings. Also, the good speakers were
to a great extent fluent, while the less good
ones had lots of repetitions, repairs etc. These
results were elaborated in an experiment in
which the sample of the speaker with the lowest
score (0.56) for good speaker was modified
in several ways. The assumption was that, rely-
ing on our production results, we could im-
prove the perceived skill of speaking.
The scores for the selected (male) speaker
were high for insecure, hesitant, monotonous
and low for expressive, powerful, aggressive
and trustworthy and he had the highest score of
all speakers for humble. He was also the second
most disfluent speaker with a total of 12 disflu-
ency positions, and he had the smallest F0

1 2 3 4 5 6 7 8
M:F0_ M:flu_M:rate 8 1 1 2 1
M:F0_ M:flu_O:rate 1 4 5 1 1 3
M:F0_ O:flu_ M:rate 4 3 3 1 1 3
M:F0_ O:flu_ O:rate 1 1 4 2 1 3 4,5
O:F0_ M:flu_M:rate 2 2 2 4 2 4,5
O:F0_ M:flu_O:rate 1 2 1 4 2 2 6
O:F0_ O:flu_ M:rate 1 1 1 5 3 1 6
O:F0_ O:flu_ O:rate 2 1 3 6 7,5
Ranking
Md Stimuli
range (2.84 ST), F0 maximum (190 Hz), and
mean F0 maximum of focused words (150Hz).
This speaker also was the slowest with a speech
rate (including pauses) of 3.46 syllables per
second. Thus, this speaker is a natural candi-
date for the resynthesis experiment.
Hypotheses
The features to be evaluated included the two
that had the highest correlations (positive and
negative, respectively) with good speaker: F0
dynamics, and disfluencies. (In the following,
when referring to experimental manipulations,
we use the more neutral term fluency instead
of disfluencies.) As the selected speaker was
extremely slow, we also included speech rate,
although the speech rate features overall corre-
lated insignificantly with good speaker.
We hypothesized that of these features, rate
would be the least effective for improvement of
speaker skill. Concerning the other two, we
could not decide in advance between alterna-
tive a) and b). There might also be interactions
between the features, giving a third alternative:
a) F0 dynamics > fluency > speech rate
b) fluency > F0 dynamics > speech rate
c) F0 dynamics, fluency and speech rate interact
Experimental setup
To create the experimental stimuli, we used the
KTH resynthesis toolkit EXPROS (Gustafson and
Edlund, in press) together with the Mbrola di-
phone synthesis toolkit (Dutoit et al., 1996).
This was a three step process: first the EXPROS
toolkit was used to automatically generate the
data needed to build a new Mbrola diphone da-
tabase from the original speech sample (36 sec-
onds in length). Then the Mbrola toolkit was
used to build a customized Mbrola mini-voice.
Finally, EXPROS was used to modify the pro-
sodic features of the original speech sample.
The following manipulations were performed:
F0 dynamics: The pitch scale was transformed to a semitone
scale. The mean pitch was increased by two semitones and the
range was expanded, so that the standard deviation increased
from 2 semitones to 4.
Fluency: Reduction of disfluencies was made by cutting out
slips of the tongue and repetitions.
Speech rate: Speech rate was increased by 5% and long silent
hesitation pauses were considerably shortened.

Thus, there were eight stimuli (2x2x2) includ-
ing all combinations of original and modified
F0 dynamics (O/M F0), fluency (O/M fluency)
and speech rate (O/M rate).
The 12 subjects in the test, all academic
staff or students in areas other than phonetics,
made a ranking of the eight versions. They did
so using an interactive computer program im-
plementing a visual sort and rate/rank method.
Each of the eight versions was represented
by an icon in random order on the computer
screen. The subjects were instructed to rank
them from best (1) to worst (8) in reference to
the criterion for a good speaker, that is, a per-
son capable of catching the attention of an au-
dience through her/his way of speaking. Be-
fore coming up with the ordering they pre-
ferred, the subjects could listen to the stimuli as
many times as they wished and try different
rankings by moving the icons around.
Results
In a complex task as this, we cannot expect to-
tal uniformity between subjects rankings. De-
spite this, there is a fair degree of consistency;
the correlation between subjects (Kendalls W)
is .48 (<.001).
Table 2. Median and distribution of 12 subjects
rankings (1=best - 8=worst) of eight versions of a
speech sample with modifications (M) of F0, flu-
ency and speech rate in the original (O) version.

The results support the general assumption that
perceived speaker skill can be improved by
modifications such as those suggested by our
production data. The general trend is that the
more modifications, the higher the ranking.
This is demonstrated in Table 2, which shows
the distribution of rankings for the eight stimu-
lus versions together with the median rankings.
The results indicate that F0 modifications play
the major role, with modifications of fluency
and speech rate being second and third. The re-
sults pooled across all subjects then come close
to an ordering according to hypothesis a) in
terms of perceptual weight. Even though all
rankings do not correspond exactly with this

ordering, they are nevertheless concentrated
around it as indicated by the shadowed area.
A more detailed analysis reveals interesting
differences between subjects. Some had ratings
reflecting a completely systematic ordering of
the features in accordance with hypothesis a),
while others were less systematic. This is not
unexpected, as making judgments about such a
phenomenon as speaker skill most reasonably is
not a simple task. The features under investiga-
tion may be expected to interact in complex
ways, but individual experiences and prefer-
ences may also play a role. Several of the sub-
jects after the test spontaneously commented on
their impressions of the stimuli. Some of them,
for example, found slips and other disfluencies
to be very disturbing, while others looked upon
the same phenomena as something natural and
more or less unavoidable. Still most of them,
according to the general result, favored a modi-
fied F0 range and some reported that they eas-
ily could divide the eight versions in two
groups (original and modified F0 dynamics),
but that priorities within these groups were dif-
ficult.
Conclusions and future work
In this study, the focus was on features contrib-
uting to the impression of a good speaker, a
person capable of catching the attention and
interest of an audience through her/his way of
communicating. We conducted a resynthesis
experiment based on the results from good-
speaker ratings combined with acoustic meas-
urements from a number of speakers. Acoustic
features which correlated significantly with the
subjective ratings were perceptually evaluated.
Our data had revealed a strong positive correla-
tion between good speaker and F0 peak height
of focused words and F0 range. A strong but
negative correlation was found for disfluency.
In the resynthesis evaluation these features,
combined with speech rate, were manipulated
through modifications in the speech sample of
the speaker rated lowest. By increasing F0 dy-
namics, eliminating disfluencies and hesitation
pauses, and speeding up the speech, the impres-
sion of speaker skill improved considerably.
Modifying F0 dynamics produced the greatest
effects with changes of (dis)fluency and speech
rate, respectively, second and third. The results
support related findings pointing to the impor-
tance of F0 variability in ratings of charisma
(Biadsy et al., 2008) and liveliness (Traun-
mller and Eriksson, 1995; Hincks, 2005).
Combined with more acoustic data, resyn-
thesis evaluations like the one conducted here
could shed further light on what makes a speak-
er a good speaker. Also, the results from the
synthesis experiment open up for useful appli-
cations, for example speaker training.
Acknowledgements
We thank John Lindberg and Roberto Bresin,
KTH, for making the evaluation software avail-
able for the perceptual ranking. We also thank
all subjects for their participation in the ex-
periments. The work has been supported by
funding from the Swedish Research Council.
References
Biadsy, F., Rosenberg, A., Carlson, R., Hirsch-
berg, J. and Strangert, E. (2008) A Cross-
Cultural Comparison of American, Palestin-
ian, and Swedish Perception of Charismatic
Speech. To appear in Proc. Speech Prosody,
Campinas, Brazil.
Carlson. R., Elenius, K. and Swerts, M. (2004)
Perceptual Judgments of Pitch Range. Proc.
Speech Prosody, Nara, Japan, 689-692.
Dutoit, T., Bataille, F., Pagel, V., Pierret, N.,
Van der Vreken, O. (1996) The MBROLA
Project: Towards a Set of High-Quality
Speech Synthesizers Free of Use for Non-
Commercial Purposes. Proc. Interspeech.
Philadelphia, USA.
Gustafson, J. and Edlund, J. (In press) expros: a
toolkit for exploratory experimentation with
prosody in customized diphone voices. To
appear in Proc. 4th IEEE Workshop on Per-
ception and Interactive Technologies for
Speech-Based Systems. Kloster Irsee, Ger-
many.
Hincks, R. (2005) Computer Support for Learn-
ers of Spoken English, Diss. Speech and
Music Communication, KTH, Sweden.
Rosenberg, A. and Hirschberg, J. (2005) Acou-
stic/prosodic and lexical correlates of char-
ismatic speech. Proc. Interspeech, Lisboa
Portugal, 513-516.
Strangert, E. (2007) What makes a good spea-
ker? Subjective ratings and acoustic meas-
urements. Proc. Fonetik 2007, TMH-QPSR,
50, 29-32.
Traunmller, H. and Eriksson, A. (1995) The
perceptual evaluation of F
0
excursions in
speech as evidenced in liveliness estima-
tions. JASA, 97 (3) 1905-1915.


Second-language speaker interpretations of intonation-
al semantics in English
Juhani Toivanen
Academy of Finland & University of Oulu

Abstract
Research is reported on the way in which Fin-
nish speakers of English interpret the seman-
tic/pragmatic meaning of the fall-rise intona-
tion in spoken English. A set of constructed
mini-dialogues were used for listening tests in
which the test subjects were to interpret the
meaning of the fall-rise tone. To obtain base-
line data, a group of native speakers of English
listened to the same material, with the same in-
terpretative task. The results indicate that the
native speakers consistently interpreted the fall-
rise pattern as conveying reservation (or iro-
ny), whereas the non-native speakers perceived
a reserved meaning only if the lexical con-
text explicitly supported such an interpretation.
Introduction
The semantic/pragmatic meaning of the fall-rise
intonation contour has attracted a great deal of
attention in the literature on English prosody
(the fall-rise is transcribed with the diacritic
V

below). Basically, the tone is associated with
reservations, implications and doubts. It can
also be argued that the fall-rise conveys uncer-
tainty or in-completion (as all rising tones
do) but the fall-rise is apparently associated
with especially delimiting open meanings; it
has sometimes, and quite rightly, been referred
to as the contingency tone in English intona-
tion. That is, the fall-rise is often an indication
that the proposition or argument is correct only
under certain circumstances. Roach (1991) uses
the terms limited agreement and response
with reservations to describe the pragmatic
meaning of the fall-rise. In the literature on the
subject, the following meanings, for example,
have been attributed to the tone: implicatori-
ness, reservation and contradiction, lack
of complete commitment, and strong implica-
tion. The common denominator is, clearly, an
indication of some concealed doubt or contrast:
the speaker may say one thing and mean some-
thing else. That is, a subtle prosody-dependent
pragmatic meaning is created.
From the viewpoint of second language ac-
quisition, intonation can be seen as belonging
to the pragmatic aspects of language. Pragmat-
ics is probably one of the most difficult areas of
second language acquisition in general. It seems
likely that misunderstandings resulting from
different ways of interpreting intonational
meaning will interfere with a common dis-
course space between the native speaker and the
non-native interlocutor even if the non-native
speech may be otherwise (e.g. grammatically)
quite acceptable.
In this light, the study of the cross-linguistic
interpretation of the semantic meaning of Eng-
lish intonation contours is a most profitable un-
dertaking. For the purpose of this study, the
meaning of the fall-rise pattern was chosen for
scrutiny. On the one hand, this contour has a
specific meaning in (British) English intona-
tion; on the other hand, the fall-rise does not
have a counterpart in Finnish intonation. As Ii-
vonen (1998) points out, a final rise is rare in
Finnish and the rules found in French, English,
and German associated with the use of final rise
do not exist in Finnish.
Experiment
To obtain suitable test material, a native speak-
er of English, a professional phonetician, was
asked to produce a declarative utterance with a
falling-rising nuclear tone on the last word. The
test utterance, by itself and combined with other
utterances, constituted the material used in the
listening test. All the speech material used in
the experimental setup was tape-recorded with
a high-quality microphone and a DAT recorder,
and transferred onto hard disk (44.1 kHz, 16
bit). The test utterance was the following (with
the nuclear tone on the latter syllable of de-
gree):
Shes got a good
V
degree
The speaker also produced a number of other
utterances to create coherent lines in the mini-
dialogues: the speaker is referred to as Bill.
Another native male speaker of English
(John) was the interlocutor and produced the
other lines in the dialogues (see the Notes sec-
tion). Four mini-dialogues contained the test
utterance; in three mini-dialogues the test utter-
ance was accompanied by one or two additional

utterances to produce Bills line. Four addi-
tional mini-dialogues served as distractors: they
did not contain the test utterance. Only the test
utterance contained a falling-rising intonation:
all the other utterances in the test dialogues
ended on simple falls. The distractor dialogues
contained both falls and rises but not fall-rises.
The listeners were all university students:
the Britons majored in non-linguistic subjects,
while the Finns were first-year university stu-
dents of English. Ten Britons and ten Finns, all
female speakers in their early twenties, partici-
pated in the listening test. The test was adminis-
tered in a language laboratory; the listeners had
written transcripts of the dialogues in front of
them, and the line to which they were to pay
attention was underlined (see the Notes sec-
tion). The test subjects listened to each dialogue
once and chose one of six descriptive labels for
the line whose attitudinal/emotional content
they were to judge. The labels describing the
lines were the following: friendly, reserved,
bored, joyful, casual, and ironical.
One or two things should be pointed out at
this stage. Firstly, dialogues 1 and 3 are basical-
ly comparable to the examples given in the in-
tonation literature: the most typical semantic
meaning of an intonation contour is often dis-
cussed out of context or in a lexical context
which clearly supports the supposed meaning of
the contour. In dialogue 3, the reservation
conveyed by the fall-rise is very much in
agreement with the lexical content of the line.
The examples given by Cruttenden (1997), for
example, are rather similar to the test utterance
in dialogues 1 and 3:
You wont
V
like it
Be careful you dont
V
fall
I like
V
John (but)
In dialogues 5 and 7, by contrast, the implica-
tion or doubt expressed by the fall-rise con-
flicts with the positive ideas expressed ver-
bally. The interesting question is, of course,
whether the pragmatic force of the fall-rise is
strong enough to counteract the lexical meaning
of the lines a point rarely discussed in the lite-
rature.
All the other utterances in the test dialogues
ended on a tone which could be described as a
high-fall (i.e. a relatively wide unidirectional
f0 movement). This tone is often assumed to
represent the most typical intonation contour
with declaratives (Cruttenden 1997). The high-
fall is common even with polar questions, at
least in informational conversation. Thus it can
be claimed that the test dialogues were intona-
tionally neutral apart from the utterance with
the fall-rise, i.e. the fall-rise is clearly a devia-
tion from the general falling trend and should
thus attract some special attention. However,
since the distractor dialogues contained rising
tones, the test sentence did not stand out as the
only utterance ending on a rising contour.
Results
In dialogue 1, nine native speakers of English
chose the attitudinal label reserved, and one
chose the term ironical. It seems clear, then,
that, to the native ear, even a largely decontex-
tualized utterance with the fall-rise sounds pre-
dominantly negative. The responses of the Fin-
nish informants, by contrast, were much more
heterogeneous. The label friendly had the
most votes (4), the other interpretations were
casual (3), reserved (2) and joyful (1).
The Finns apparently paid attention mainly to
the lexical content of the utterance. On the oth-
er hand, it might be the case that the Finnish
informants associated the falling-rising intona-
tion with friendliness. After all, (low) rising
intonation often accompanies conventionally
polite declarative utterances in spoken English.
The small amount of data, of course, prevents
one from making any far-reaching conclusions.
An interesting question is whether the test
utterance, produced with a simple falling into-
nation, might still convey implications or
reservations in dialogue 1. That is, could the
utterance (Shes got a good degree), as a re-
sponse to the question (What do you think of
her?) convey a conversational implicature of
some kind? It might be possible that the speak-
er deliberately flouts the maxims of quantity
and relevance in saying far too little. The situa-
tion might resemble the famous (and extremely
concise) critique of a book:
The book is well-bound and free of ty-
pographic errors
The review flouts the maxim of quantity, and
the implicature is, clearly, that the book is terri-
ble. In dialogue 1, even without the fall-rise, the
implicature might be something like the lady
is well-educated but is a difficult person.
However, it must be emphasized that this is
basically only speculation.
Brown and Yule (1983) describe the di-
lemma of the hearer and discourse analyst as
follows: since the analyst has only limited
access to what a speaker intended, or how sin-
cerely he was behaving, in the production of a
discourse segment, any claims regarding the

implicatures identified will have the status of
interpretations. This latitude of interpretation
would probably obtain in dialogue 1 if the test
utterance were spoken with a falling tone. Natu-
rally, the semantic interpretation of the utter-
ance produced with vs. without a fall-rise
should be investigated in a separate study.
Dialogue 3 is very different from dialogue
1: here the reservations and doubts are ex-
pressed both verbally and prosodically. Here, as
in dialogue 1, nine native speakers heard res-
ervations, while one interpreted the speaker as
ironical. Eight Finns chose the label re-
served, one chose bored and one ironical.
The situation seems rather straightforward: as
the lexical content is in harmony with the into-
nation contour, the reserved meaning was
readily perceivable. However, it is likely that
the Finns again regarded the lexical content as
the major factor contributing to the attitudin-
al/pragmatic meaning of the line. That is, even
without a falling-rising intonation, the attitude
might have been obvious (the same probably
goes for the British test subjects). In any case,
in this dialogue, the lexical meaning prejudices
the listener much more than in dialogue 1.
Dialogues 5 and 7 can be discussed togeth-
er. In both of them, the lexical meaning of the
test line is apparently very positive while the
tone is, again, the fall-rise conveying possible
doubts or implications: the written version of
the line could easily be interpreted as friendly
or even joyful. Indeed, this predisposition was
clear in the responses of the Finnish test sub-
jects: in dialogue 5, nine speakers interpreted
the speaker as friendly (one chose the label
joyful), and in dialogue 7, eight informants
chose friendly, one joyful and one casual.
The British informants reactions differ mar-
kedly from those of the Finnish listeners. In di-
alogue 5, an ironical attitude was detected by
six informants, a reserved attitude by three,
and a casual attitude by one. I dialogue 7,
most of the informants (seven listeners) heard
an ironical attitude, while the rest interpreted
the speaker as reserved.
Apparently, in dialogues 5 and 7, the Finns
did not perceive any potential conflict with the
lexical meaning and the tone choice: the fall-
rise did not detract from the general positive
attitude expressed verbally. By contrast, the na-
tive speakers were obviously aware of the clash
between lexical meaning and the attitude con-
veyed by intonation. Interestingly, many of the
informants thought that the line was meant to
be ironical: the fall-rise was probably perceived
as being out of place in an otherwise positive
part of dialogue, and the mismatch was attri-
buted to an ironical attitude. The situation here
may be partly similar to the example given by
Watt (1994). If the following utterance is ac-
companied by a smiling voice quality and a
gentle low rising intonation, a mordant effect of
sarcasm or irony is probably created:
Put that goddam pipe away
Incongruous linguistic content and intonation
conspire to produce a stylistic effect which is
likely to irritate or even unsettle the listener. As
Watt points out, intonation has an interperson-
al metafunction by serving as a channel for
linguistic expression of attitude. Lexicon and
(stylistic) register are other channels, and the
interaction of these modes creates attitudinal
meanings of different kinds.
Discussion
The investigation has revealed some interesting
differences in the semantic interpretation of the
fall-rise contour between native and non-native
speakers of English. The results support the
common view that the general pragmat-
ic/semantic meaning of the fall-rise can be de-
scribed in terms of such attitudinal labels as
reserved and doubtful: the native speakers
systematically associated reservations with the
tone when the lexical content was either neu-
tral or congruous with such an interpretation.
If there was a mismatch between words and the
tone, the clash was mainly interpreted as irony.
The British informants apparently had a very
clear idea about where the fall-rise fits in prop-
erly and where it is used for a deliberate pho-
nostylistic effect. The British informants could
thus analyze the meaning of the fall-rise also at
a metalinguistic level.
The Finnish informants mainly resorted to
the so-called lexico-syntactic strategy (see e.g.
Cruz-Ferreira 1986): speakers of a second lan-
guage analyze the (semantic/pragmatic) mean-
ing of an utterance as corresponding to the most
immediate interpretation of the lexical and
grammatical content of the sentence.
The conclusions drawn on the basis of this
investigation are supported by a study of the
productive English intonation skills of Finns:
Toivanen (2001) offers empirical evidence that
Finnish speakers of English very rarely use the
falling-rising tone in conversation. Thus, al-
though Finns can make, phonetically and pho-
nologically, a distinction between falling and
rising intonation, Finns are very hesitant about

associating rising tones with informational
and/or pragmatic openness. In the colloquial
English speech of Finns, reserved or incom-
plete statements are typically accompanied by
falling tones in contradistinction to the Eng-
lish spoken by native speakers.
Conclusion
This investigation, although based on limited
and somewhat artificial material, suggests that
even very advanced Finnish speakers of English
do not fully master the intonational lexicon of
English. Finns are largely unaware of the prag-
matic meaning of the fall-rise intonation con-
tour, and analyze the tone as conveying reser-
vations only when the lexical meaning allows
for such an interpretation. The British infor-
mants readily perceive the reserved meaning of
the fall-rise. However, if the context suggests
an entirely different semantic interpretation, the
native speakers are likely to conclude that the
contrast between the lexical meaning and the
fall-rise indicates irony of some kind.
Notes
1.
John: What do you think of her?
Bill: Shes got a good degree.
Bill sounds a) friendly, b) reserved, c) bored, d)
joyful, e) casual, f) ironical.
2.
John: What are you doing here?
Bill: Just waiting for Tim. He seems to be late
for the meeting.
3.
John: Do you think shes qualified for the job?
Bill: I dont know. Shes got a good degree. But
she hasnt got much experience.
4.
John: Excuse me, how much is this magazine.
Theres no price tag on it.
Bill: But there must be a price tag.
5.
John: My daughter has just graduated from
university. Shes a lawyer now.
Bill: Im glad to hear that. Shes got a good de-
gree.
6.
John: Can I borrow your car tonight?
Bill: If you really need it I guess you can.
7.
John: Did you know that my new boss has got a
doctorate in electrical engineering?
Bill: Yes. Shes got a good degree. And shes
also a nice person.
8.
John: Hey, Bill. May I use your cell phone? I
seem to have misplaced mine.
Bill: By all means.
References
Brown G. and Yule G. (1983) Discourse Analy-
sis. Cambridge: Cambridge University
Press.
Cruttenden A. (1997) Intonation. Cambridge:
Cambridge University Press.
Cruz-Ferreira M. (1986) Non-native interpre-
tive strategies for intonational meaning: an
experimental study. In James A. and Leather
J. (eds) Sound Patterns in Second Language
Acquisition, 256-269. Dordrecht: Foris Pub-
lications.
Iivonen A. (1998) Intonation in Finnish. In
Hirst D. and DiCristo A. (eds) Intonation
systems. A survey of twenty languages, 311-
327. Cambridge: Cambridge University
Press.
Roach P. (1991) English phonetics and phonol-
ogy. A practical course. Cambridge: Cam-
bridge University Press.
Toivanen J. (2001) Perspectives on Intonation:
English, Finnish and English Spoken by
Finns. Frankfurt am Main: Peter Lang.
Watt D. (1994) The Phonology and Semology
of Intonation in English. Bloomington, Indi-
ana: Indiana University Linguistics Club
Publications.


Measures of Continuous Voicing related to
Voice Quality in Five-Year-Old Children
Mechtild Tronnier
1
and Anita McAllister
2

1
Department of Culture and Communication, University of Linkping
2
Department of Clinical and Experimental Medicine, University of Linkping

Abstract
The present investigation pursues the question
whether the perceptual judgement of a childs
voice to sound hoarse can be correlated to the
degree of non-periodical sections in the signal,
based on the lack of regular oscillation of the
vocal folds. The results show that this is not the
case: children with clearly hoarse voices pro-
duce a stable and measurable fundamental fre-
quency. In addition, the recording containing
the highest number of periodicity is rated as the
roughest voice and children who show a ten-
dency toward unstable fundamental frequency
are not perceptually evaluated as being par-
ticularly hoarse.
Introduction
A hoarse voice consists of an assembly of de-
viations of several perceptual voice quality di-
mensions. The prominent parameters are:
roughness, breathiness and hyperfunction. Ac-
cording to perceptual evaluations, hoarseness in
adult voices is dominated by the feature rough-
ness, followed by breathiness and hyperfunc-
tion. In ten-year-old children however, hyper-
functionality and breathiness are the main per-
ceptual features of a hoarse voice, with minor
contributions of roughness (Sederholm et al.
1993, Sederholm 1995).
Acoustically a rough voice is typically char-
acterised by irregular phonation in the temporal
dimension (jitter) and in the intensity dimen-
sion (shimmer) (Laver, 1980). When character-
ising the voice quality of a healthy speaker with
a subtle tendency toward a hoarse voice, irregu-
larities usually are very small, but however pre-
sent and perceivable. For a more severe degree
of hoarseness irregularities are more prominent.
In the case of hoarseness due to an extraordi-
nary cold, voicing can be absent, i.e. no peri-
odic signal is produced during a short sequence
of speech. A voice judged as rough is therefore
expected to correlate with a minor scope of pe-
riodicity.
As a breathy voice is characterized as weak
and ineffective with incomplete closure leading
to a leakage but however consisting of regular
phonation, no correlation between the degree of
breathiness and the lack of periodicity in the
signal is expected to occur.
A hyperfunctional voice is not recognised in
terms of high levels of jitter and shimmer as
analyses of perturbation show highly regularity
in both dimensions (Klingholz & Martin, 1985;
McAllister et al. 1998). Spectral measures
like a low level of the fundamental frequency
relative to the first formant may be more ap-
propriate.
The main question underlying the present
investigation is therefore to what extent chil-
drens voices judged by professional speech
pathologists as predominantly being hoarse
show absence of periodicity in the produced
speech signal. Hereby, the correlation between
the proportion of periodicity and the three
dominant parameters are taken into considera-
tion.
Making use of the degree of measurable
fundamental frequency as a method of voice
stability assessment in children is a further as-
pect of discussion in this investigation. The in-
tention is to test an alternative method to the
analysis of perturbation since experience shows
that the available systems often are standardised
for adult voices and not suitable for tracking the
high fundamental frequency in childrens
voices.
Material and method
The material investigated in the present study is
part of the data gathered for the project Barn
och buller (Children and noise), which is a co-
operation between the University of Linkping
and KTH, Stockholm, within the BUG project
(Barnrstens utveckling och genusskillnader;
Child Voice Development and Gender Differ-
ences;http://www.speech.kth.se/music/projects/
BUG/abstract.html). It consists of recordings
from eleven five-year-old children, attending

three different pre-schools in Linkping. These
children were recorded using a binaural tech-
nique three times during one day at the pre-
school: at arriving in the morning and gather-
ing, during lunch and in the afternoon during
play time. At the beginning of each recording
session the children were ask to repeat the fol-
lowing phrases three times: En bl bil. En gul
bil. En rd bil. The repetitions of these
phrases, which consists of voiced phonemes
only, establish the base for the current investi-
gation.
In an earlier study these phrase repetitions
were used to perceptually assess the degree of
hoarseness, breathiness, hyperfunction and
roughness by three professional speech pa-
thologists (McAllister et al., in press). Assess-
ment was carried out by marking the degree of
each of the four voice qualities plus an optional
parameter on a Visual Analog Scale (VAS).
The averaged VAS-ratings by the speech pa-
thologists for each individual and each percep-
tual parameter in the recordings were used in
the present investigation.
These recordings were used for an acoustic
analysis of sustained voicing. However, the ma-
terial was altered in that all pauses and silences
not belonging to a phonologically voiced parts
of the utterance were cut out. An analysis of the
fundamental frequency was then carried out in
PRAAT with a sampling rate of 100Hz. The
scope of voicing (in %) was obtained from the
number of obtained F0-values analysed in
PRAAT in relationship to the expected number
of F0-values consistent with the length of the
utterance.
A Pearson Product-moment correlation
was carried out between the scope of voicing
and the average rating of the different types of
voice quality.
Results
The results are shown in Table 1 and in the
Figures 1 to 4. Table 1 presents the scope of
voicing (in %) for each recording and the aver-
age rating for the comprehensive voice quality
hoarseness and the more specific voice quality
features breathiness, hyperfunction and rough-
ness (in mm VAS-scale). Furthermore the mean
and the standard deviation over the different
recordings for the voicing rate and perceptual
voice quality is presented in the table. Marked
cells refer to noteworthy values which are taken
up in the discussion.
The figures show the distribution of the
scope of voicing in relationship to different
types of voice qualities. The included trendline
reflects the correlation between the scope of
voicing and the degree of deviating voice qual-
ity. In addition, the correlation coefficient
(squared r-value) is presented in each diagram,
which underlies the trendline.
Table 1. The scope of voicing and the average rat-
ing of the different voice qualities are shown for all
recordings.
recor
ding
taken
F0
(in %)
hoarse-
ness
(in mm
VAS)
breathi
ness
(in mm
VAS)
hyper-
func-
tion
(in mm
VAS)
rough-
ness
(in mm
VAS)
101 88.09 14.5 20 13 3.5
201 86.53 15 7.5 9.5 1
301 90.65 13.5 12 11 4
102 94.99 7.5 15.5 3.5 1
202 98.89 24 23.5 25.5 0.5
302 93.66 12 16 13 1.5
103 99.90 33.5 28.5 49.5 16
203 96.99 29 15 34.5 1.5
303 93.19 38.5 38.5 37.5 0.5
104 89.21 41.5 37.5 28.5 2
204 84.90 29 21 27.5 1.5
304 98.13 51 40 52 1.5
105 94.5 20.5 29 14 1
205 93.7 18.5 21 18.5 2
305 98.04 10 13.5 10.5 1
106 84.84 26 33.5 2 0
206 84.68 13.5 20 12 1
306 85.79 24 28.5 2 1
107 88.2 53 44 57 1.5
207 88.22 81 64.5 76.5 6.5
307 81.40 72.5 71 79.5 5.5
108 92.37 19 18.5 9 0.5
208 94.5 14 18 1.5 0.5
308 96.17 14 17.5 3 0
109 90.50 21.5 23 26.5 0.5
209 97.53 23 27 44.5 1.5
309 95.19 29.5 21 32.5 0.5
110 93.62 16.5 11.5 35.5 1.5
210 95.88 21 25.5 36 0.5
310 88.05 25.5 13.5 24.5 1.5
111 65.31 29.5 29 2.5 1.5
211 82.77 26 27.5 10.5 0.5
311 78.59 28.5 26.5 3.5 0.5
mean
90.45 27.15 26.02 24.44 1.92
sd 7.11 16.71 13.87 21.03 2.91


R
2
= 0,0439
0
10
20
30
40
50
60
70
80
90
60 70 80 90 100
F0 taken %
h
o
a
r
s
e
n
e
s
s
hoarseness
Linjr (hoarseness)

Figure 1. Correlation between the scope of voicing
(in %) and degree of hoarseness (in mm VAS).

R
2
= 0,0334
0
10
20
30
40
50
60
70
80
90
60 70 80 90 100
F0 taken %
h
y
p
e
r
f
u
n
c
t
i
o
n
Hyperfunction
Linjr (Hyperfunction )

(in %) and degree of hyperfunction (in mm VAS).
Discussion and Conclusions
There is generally a high degree of voicing for
all recordings and no great variation between
the different voices and recordings.
As the trendlines and the correlation coeffi-
cients show, no significant correlation (neither
positive nor negative) between the scope of
voicing and the degree of perceived voice qual-
ity can be found.
However, an interesting parallelism can be ac-
counted for concerning the trendlines in Figure
1 and Figure 2, which has a negative orientation
in both cases. It can be interpreted that the rela-
tionship between the scope of voicing and
hoarseness is similar to the relationship be-
tween scope of voicing and one of the features
of hoarseness: i.e. breathiness. This parallelism
is not found for the other features and shows

R
2
= 0,0619
0
10
20
30
40
50
60
70
80
90
60 70 80 90 100
F0 taken %
b
r
e
a
t
h
i
n
e
s
s
Breathiness
Linjr (Breathiness)

(in %) and degree of breathiness (in mm VAS).

R
2
= 0,0137
0
2
4
6
8
10
12
14
16
18
60 70 80 90 100
F0 taken %
r
o
u
g
h
n
e
s
s
roughness
Linjr (roughness)

(in %) and degree of roughess (in mm VAS).

that breathiness is a stronger feature for a five-
year-old childs voice to sound hoarse then the
other two features. In an earlier study for ten-
year-old children however, hyperfunctionality
was found to contribute to the perception of
these children as being hoarse as much as the
feature breathiness.
Recordings of the children with the highest
level of hoarseness show stable periodicity in
the production of fundamental frequency (cf.
Table 1, recording 207 and 307). For those re-
cordings high ratings of breathiness and hyper-
function were given, however rather low ratings
for the feature roughness, which is closely re-
lated to irregular periodicity in voices. When
looking closer at how these children act when
participating in the recordings, one can surely
state that they produce very lively speech, loud
and within a wide frequency range. It is obvi-
ous also from the more extended material that

these children are very active speakers but their
voicing does not fail despite the strong degree
of perceived hoarseness.
The child who showed a very low degree of
measured fundamental frequency (cf. Table 1,
recording 111) shows a low degree of rough-
ness, which would be the appropriate feature to
represent aperiodicity. Breathiness is graded as
similarly strong as the degree of hoarseness,
however on a moderate level. This child is not
behaving quite as lively as the children in the
recordings mentioned before and speaks in a
rather quiet and more monotonous voice, some-
times even whispers. This is the case for the
section included in the investigation material
and also in the following part of the recording,
containing spontaneous speech.
It is also noteworthy that the recording with
the broadest scope of voicing (103) shows the
highest degree of roughness among the re-
cordings. Although the value is relatively low
when compared to the ratings of the other fea-
tures, this stands in contradiction to the as-
sumption that roughness is based on instable
voicing. This voice is also rated as moderately
hyperfunctional, but has a lower rating for the
overall impression to sound hoarse. The ratings
of this recording show that either the idea of
making use of calculating the scope of periodic-
ity in the way it has been done in this investiga-
tion to account for instability in a childs voice
has to be taken with caution or the relationship
between the evaluation of a childs voice to
sound rough and the degree of irregularity has
to be studied more closely.
One has to conclude that the degree of ab-
sence of measurable fundamental frequency is
not related to any of the types and the degree of
several voice qualities like hoarseness, breathi-
ness, hyperfunction or roughness in five-years-
old children, but rather on behavioural factors,
Therefore our conclusion is that this measure
does not offer an alternative to the analysis of
perturbation.
For a clear picture of what acoustically de-
notes a childs voice to sound hoarse obviously
other aspects e.g. spectral aspects have to be
taken into consideration.

References
Klingholz F, and F. Martin (1985). Quantitative
spectral evaluation of shimmer and jitter. J
of Speech and Hearing Research 28, 169-
74.
Laver, J. (1980) The phonetic description of
voice quality. Cambridge University Press.
McAllister, A., Sundberg, J, Hibi S. (1998):
Acoustic measurements and perceptual
evaluation of hoarseness in childrens
voices. Log Phon Vocol, 23, 27-38.
McAllister, A., S. Granqvist, P. Sjlander and J.
Sundberg (in press) Child voice and noise: a
pilot study of noise in day-cares and the ef-
fects on ten childrens voice quality accord-
ing to perceptual evaluation.
Sederholm E., A. McAllister, J. Sundberg and
J. Dalkvist (1993) Perceptual analysis of
child hoarseness using continuous scales.
Scand Journal of Logopedics and Phoniat-
rics 18, 7382.
Sederholm, E. (1995) Prevalence of hoarseness
in ten-year-old children. Scand Journal of
Logopedics and Phoniatrics 20, 165173.
Yumoto, E. Y. Sasaki and H. Okamura (1984)
Harmonic-to-noise ratio and psychophysical
measurement of the degree of hoarseness.
Journal of Speech and Hearing research 27,
26

On final rises and fall-rises in German and Swedish
Gilbert Ambrazaitis
Linguistics and Phonetics, Centre for Languages and Literature, Lund University

Abstract
This study explores the intonational signalling
of a request address in German and Swedish.
Data from 16 speakers (9 Germans, 7 Swedes)
were elicited under controlled conditions, and
intonation contours produced on the test phrase
Wallander? were classified according to
their phrase-final pattern. Both rises and
fall-rises were produced frequently by both
Germans and Swedes, which is in line with
Ohalas frequency code, but challenging for the
Lund model of Swedish intonation.
Introduction
The tonal system of Swedish is usually said to
differ largely from that of otherwise closely re-
lated languages such as German, Dutch, or Eng-
lish. One reason for this conception is, of
course, the presence of the tonal word accents
in Swedish, which are absent in the standard
variety of, e.g., German. But the difference be-
tween the intonational systems of German and
Swedish, as they have been described in the lit-
erature, goes far beyond the presence or ab-
sence of lexical tonal phenomena, respectively.
Table 1 displays one example each of phono-
logical accounts of Swedish and German into-
nation: the Lund model for Swedish (Bruce,
1998; 2005), and GToBI for German (Grice et
al., 2005). They have been chosen because both
are contemporary and formulated in terms of
autosegmental-metrical (AM) phonology, i.e.,
they should be formally comparable.
Table 1. Accents and final boundary tones (b.t.) in
GToBI for German (Grice et al. 2005) and the Lund
model for Swedish (Bruce 1998; 2005).
Standard German Standard Swedish
function accents b. t. accents b. t.
lexical H+L*
H*+L

non-
lexical
H*
L+H*
L*
L*+H
H+L*
H+!H*
L-
H-
L-%
L-H%
H-%
H-H%

H-

L%
LH%
According to Table 1, Swedish and German dif-
fer not only with respect to lexical, but also
largely with respect to non-lexical, or utterance-
related, tonal features: While German has six
different accents on the utterance-level, Swed-
ish has only one, known as the focal accent. A
similar relation holds for final boundary tones.
But the conclusion that Swedish has a much
poorer utterance prosody than German may,
of course, only be drawn under the premise that
the two models in Table 1 are (a) adequate and
(b) equivalent, in the sense that they have been
developed under equivalent conditions. How-
ever, it may be argued that Swedish and Ger-
man intonation research are characterized by
different preconditions and traditions to the ex-
tent that a formal comparison even of contem-
porary models does not reveal any reliable in-
formation on actual differences between the in-
tonational systems of the two languages.
This study is part of a larger comparative
project on Standard Swedish and Standard
German intonation, from a communicative-
functional perspective. Its general hypothesis is
that there are more similarities than indicated
by contemporary models (cf. Table 1). The gen-
eral method is to elicit certain utterance types,
or speech acts, defined by constructed (but real-
istic) discourse contexts, in both Swedish and
German, keeping the material, the situational
context, and the recording conditions as con-
stant as possible.
This paper deals with one such utterance
type, which may be labelled a request address
as exemplified by Wallander? in the follow-
ing situational context: A police officer from
Ystad (Southern Sweden) to his colleague:
Wallander? Would you mind if I asked you for
a favour? The goal of this paper is to gain a
preliminary impression of the intonation pat-
terns used by Germans and Swedes in such re-
quest addresses. For that, a classification of the
obtained intonation contours is undertaken, and
the distribution of patterns, as well as the pho-
netic form of the most frequent patterns, is
compared for Swedish and German. The classi-
fication concentrates on the phrase-final accent
pattern, or the nuclear tune in the British tra-
dition, defined as the last (in this study, the only

one present) pitch accent in an intonation
phrase plus the final boundary tone (cf. next
section).
Phrase-final intonation patterns
For German, a large variety of phrase-final in-
tonation patterns exists according to Table 1.
For the purpose of this study, however, a less
detailed classification will suffice: The nuclear
pattern is either a fall, a rise, or a fall-rise.
A fall has a high stressed or post-stress sylla-
ble and a low boundary tone (e.g., (L+)H* L-%,
L*+H L-%); a rise has a boundary tone higher
than the last accentual tone (e.g., L* H-%, H*
H-^H%); and finally, a fall-rise has a high
stressed or post-stress syllable, and a low-high
sequence as a boundary tone (e.g., H* L-H%).
For Swedish, no such three-fold contrast has
been described. The focal accent H- always in-
volves either a high stressed syllable (words
with accent I), or a high tone later in the word
(words with accent II). Combining this H- with
the two possible boundary tones in Table 1 re-
sults in a fall (H- L%), or a fall-rise (H-
LH%), respectively. That is, a rise (at least
one connected to utterance-level prominence,
cf. discussion), as defined for German above, is
not recognized by the Lund model for Swedish.
Final rises (or fall-rises) are often associated
with the notion of question intonation or with
continuation in a variety of languages. For
German, e.g., question and continuation in-
tonation seem to differ in range and shape of
the rise (Dombrowski and Niebuhr, 2005). Syn-
tactic factors have some influence on whether a
German question is falling or rising, but in gen-
eral, in accordance with Ohalas (1984) fre-
quency code, a rise signals a greater subordina-
tion of the enquirer towards the addressee
(Kohler, 2005). Rising intonation may thus
more frequently be found in connection with
polite questions. In Swedish, questions are
typically said not to be marked by final rises
(Grding, 1979). The rising boundary tone
LH% of the Lund model has in fact hardly been
discussed from a functional perspective; one
function that has been mentioned is the signal-
ling continuation (Gussenhoven, 2004).
Hypothesis
According to the contemporary descriptions,
Swedish and German intonation patterns should
be expected to differ in the expression of a re-
quest address. Considering a request address
as some kind of polite question, or at least a
function connected with a subordination of the
speaker towards the addressee, one would ex-
pect a rise, or possibly a fall-rise for German.
For Swedish, on the other hand, a fall and a
fall-rise are the only patterns offered by the
Lund model, where the fall-rise is not associ-
ated with question intonation.
Method and materials
German and Swedish subjects were asked to
read test utterances from a computer screen in
an experimental studio at the Humanities Labo-
ratory at Lund University. All utterances consti-
tuted parts of constructed dialogues. For each
test item, a short text describing a situational
context was displayed on the screen, followed
by the test utterance. The speakers were asked
to render the test utterance as natural as possi-
ble. Five repetitions of each item were re-
corded. There were 13 test items in total, and
the whole list of 65 items was randomized. So
far, 7 speakers of Standard Swedish (4 female),
and 9 speakers of Standard German (6 female)
have been recorded.
The test material of this study consists of
one of the 13 items, the one-word phrase Wal-
lander?, both for German and for Swedish. It
constituted the first part of the test utterances
Wallander? Skulle jag kunna f be dig om en
tjnst? (Swedish), and Wallander? Drfte
ich Sie um einen Gefallen bitten? (German).
The database of this study consists of all 5 repe-
titions by all 16 speakers, hence 80 renderings
of Wallander?, 35 by Swedish, and 45 by
German speakers.
As a first step in data analysis, all intonation
contours were categorized according to their
phrase-final patterns as described above. The
classification was done manually (by inspect-
ing the F0 contours, auditorily and visually) by
the author. In a second step, differences be-
tween the German and Swedish realizations of
the categories obtained in step 1 were looked
for. For that, each token of Wallander? was
segmented into 5 units corresponding to /(v)a/,
/l/, /a/, /nd/, /de(r)/. The segmentation was done
manually with the help of a spectrogram. All
segments were fully voiced; initial fricatives (as
possible realization of /v/) were, if present, ex-
cluded from the initial segment. The boundary
between /nd/ and /de(r)/ was set immediately
before the plosive burst of /d/. For the purpose
of visual comparison, F0 contours were time-
normalized, by representing each of the 5 seg-
ments by 10 equidistant F0 measurements.

Table 2. Distribution of nuclear intonation patterns
by Swedish and German speakers. N = absolute
number of items; Speak. = speakers who (at least
once) produced a pattern; the first letter in speaker
label indicates sex (M = male; F = female).
German Swedish
% N Speak. % N Speak.
Fall 13.3 6 Mmk; Mas 17.1 6 Fss; Mnh
Fall-rise 17.8 8 Fjd; Fll: Fcf 42.9 15 Fkb; Fcw;
Mmr
Rise 66.7 30 Fib; Fjd;
Fkm; Fmt;
Fcf; Mms;
Mas
25.7 9 Mmu; Mnh
Unclear 2.2 1 Fll 2.9 1 Fek
Other 0.0 0 - 11.4 4 Fek
Sum 100 45 9 100 35 7

0
5
10
15
20
25
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52
normtime
s
e
m
i
t
o
n
e
s
German FR (N=8) Swedish FR (N=15)
German R (N=30) Swedish R (N=9)
Figure 1. Average F0 contours in semitones (0
semitones set to 100 Hz) of rises (R) and fall-rises
(FR) by German and Swedish speakers. Time is
normalized (10 data points per segment); the
breaks in the curves indicate segment boundaries;
vertical lines mark the vowel of the stressed sylla-
ble. Observe that the curves are based on produc-
tions from speakers of different sex: German FR
(female); German R (female and male); Swedish
FR (female and male); Swedish R (male).
Results
In most cases, a classification as either fall, rise,
or fall-rise was unproblematic, since these con-
tours were produced rather prototypically.
There were only two cases (one for each lan-
guage), where a decision between fall or fall-
rise was problematic (there was a slight rise of
less than 1 semitone). Four patterns (all by the
same female Swedish speaker Fek) were classi-
fied as other: Two were actually falling, but
lacked the typical rising focal accent H-, and
two cases exhibited a high-level monotone
throughout the word.
Distribution of patterns
Table 2 displays the distribution of the patterns
obtained for Swedish and German speakers. It
does, however, not include exact information
on how the N occurrences of a particular pat-
tern are distributed over the speakers listed un-
der Speak.. Most speakers in fact produced
the same pattern type in all of their 5 repeti-
tions. Only 4 of the German (Mas, Fjd, Fll, Fcf)
and 2 of the Swedish speakers (Mnh, Fek) oc-
casionally produced different pattern types.
Table 2 shows that each of the three nuclear
intonation patterns (fall, rise, fall-rise) was pro-
duced by at least two speakers of each lan-
guage. However, the German speakers most
frequently chose a (simple) rising pattern, while
the Swedes seemed to prefer a fall-rise. But
note that the fall-rise was actually produced by
only 3 of the 7 Swedish speakers, while the
(simple) rise is distributed over 7 of 9 German
speakers. In order to test for an interaction of
language and preference for either a rise or a
fall-rise, the data were re-arranged as follows: If
a speaker had produced pattern X in at least 3
(of 5) repetitions, s/he was classified as a X
speaker. This arrangement is shown in Table 3.
Table 3. Number of German and Swedish speakers
who preferred either a rise or a fall-rise.
German Swedish sum
Fall-rise speakers 2 3 5
Rise speakers 6 2 8
Sum 8 5 13
Fishers exact test, however, revealed that the
interaction between language and preference for
either a rise or a fall-rise, which is slightly indi-
cated by the data, is not significant (p=.2494).
Contour shape of rises and fall-rises
Figure 1 displays the mean F0 contours of the
rises and the fall-rises in semitones as produced
by the relevant German and Swedish speakers
(cf. Table 2). Since F0 was not speaker-
normalised and the curves emerge from speak-
ers of different sex, the absolute height of F0
values should be ignored. However, F0 move-
ments, or relative heights, may be compared,
since the F0 measure used is logarithmic.
Both for the rise and for the fall-rise there
appears to be at least one salient difference be-
tween the Swedish and German productions:
As for fall-rises, the final rise spans about 5
(v)a l a nd de(r)

semitones for both Germans and Swedes, but
the relative height of the end point differs cru-
cially: it is higher than the accentual F0 peak
for German, but lower than the accentual (fo-
cal) peak for Swedish. Furthermore, the accent
peak appears to be timed somewhat later in the
German compared to the Swedish data. As for
the rises, there is a pronounced tonal step down
from the pre-stress to the low stressed syllable
in the Swedish productions, which is very small
in the German data. There was only little varia-
tion among the repetitions within each category
and language regarding these characteristics.
Discussion
In this study, at least two different pattern types
a rise and a fall-rise resulted from the elici-
tation of a function labelled request address.
Whether these two types represent two equiva-
lent strategies for expressing the same function,
or whether they in fact express different func-
tional nuances that have not been controlled,
will have to be tested in future research.
However, the results indicate that some
form of rise (rise or fall-rise) is the most
frequently occurring final pattern in connection
with a request address, both in German and in
Swedish. This is in line with the frequency code
(Ohala, 1984), which associates high or rising
pitch with the expression of subordination, in
contrast to low or falling pitch, signalling domi-
nance. However, the result challenges the Lund
model, since the rising boundary tone (LH%)
offered by the model has so far not been associ-
ated with the signalling of subordination.
The most salient difference found between
the German and the Swedish data concerns the
step from the pre-stress to the stressed syllable
in the rise patterns, which is very pronounced
in the Swedish, but very small, if not absent, in
the German data. This is in line with the Lund
model, which predicts such an early fall for
accent I when a rising focal accent (H-) is miss-
ing. In fact, the only possibility of the Lund
model to deal with these rises is to describe
them as a non-focal accent I plus rising bound-
ary tone (H+L* LH%).
However, utterances of the type discussed
here (request address) have traditionally not
been within the scope of the Lund model,
which is actually based on statements only.
These may be realized by several prosodic
phrases, and the Lund model assumes that each
of such phrases contains at least one focal ac-
cent (actually, the term phrase accent, as used
earlier, would be more appropriate, since a
statement consisting of two phrases with one
phrase accent each could still have only one
word in narrow focus).
Thus, the Lund model analysis H+L* LH%
for the rise is problematic, since it renders a
phrase lacking any phrase accent. In a com-
parison with other Germanic languages, the
original assumption by the Lund model is plau-
sible, since at least one word in a (non-
interrupted) phrase in, e.g., German and Eng-
lish, is always conceived of as accented (nec-
essarily on the phrase/utterance level). It has
been argued that Swedish, like German, has an
early falling accent on the utterance-level as
well, which is used in the expression of con-
firmation (Ambrazaitis, 2007). The present
Swedish data could also be analyzed in the light
of this earlier finding, i.e., as instances of a low
utterance-level accent, which exists besides the
classical rising focal accent H-.
References
Ambrazaitis G. (2007) Expressing confirma-
tion in Swedish: the interplay of word and
utterance prosody. Proc. 16th ICPhS (Saar-
brcken, Germany), 10931096.
Bruce G. (1998) Allmn och svensk prosodi.
Lund: Institutionen fr Lingvistik.
Bruce G. (2005) Intonational prominence in va-
rieties of Swedish revisited. In Jun, S.-A.
(ed) Prosodic Typology: The Phonology of
Intonation and Phrasing, 410429 . Oxford:
OUP.
Dombrowski E. and Niebuhr O. (2005) Acous-
tic patterns and communicative functions of
phrase-final F0 rises in German: Activating
and restricting contours. Phonetica 62, 176
195.
Grice M., Baumann S., and Benzmller R.
(2005) German intonation in autosegmental-
metrical phonology. In Jun, S.-A. (ed) Pro-
sodic Typology: The Phonology of Intona-
tion and Phrasing, 5583 . Oxford: OUP.
Gussenhoven C. (2004) The Phonology of Tone
and Intonation. Cambridge: CUP.
Grding E. (1979) Sentence intonation in
Swedish. Phonetica 36, 207215.
Kohler K. (2005) Pragmatic and attitudinal
meanings of pitch patterns in German syn-
tactically marked questions. AIPUK (Ar-
beitsberichte IPdS Kiel) 35a, 125142.
Ohala J.J. (1984) An ethological perspective on
common cross-language utilization of F0 of
voice. Phonetica 41, 116.

SWING:
A tool for modelling intonational varieties of Swedish
Jonas Beskow
2
, Gsta Bruce
1
, Laura Eno
2
, Bjrn Granstrm
2
, Susanne Schtz
1
(alphabetical order)
1
Dept. of Linguistics & Phonetics, Centre for Languages & Literature, Lund University
2
Dept. of Speech, Music & Hearing, School of Computer Science & Communication, KTH

Abstract
SWING (SWedish INtonation Generator) is a new
tool for analysis and modelling of Swedish in-
tonation by resynthesis. It was developed in or-
der to facilitate analysis of regional varieties,
particularly related to the Swedish prosody
model. Annotated speech samples are resynthe-
sized with rule based intonation and audio-
visually analysed with regard to the major in-
tonational varieties of Swedish. We find the tool
useful in our work with testing and further de-
veloping the Swedish prosody model.
Introduction and background
Our object of study in the research project
SIMULEKT (Simulating Intonational Varieties
of Swedish, supported by the Swedish Research
Council) (Bruce et. al, 2007) is the prosodic
variation characteristic of different regions of
the Swedish-speaking area, shown in Figure 1.

Figure 1. Approximate geographical distribution of
the seven main regional varieties of Swedish.
The seven regions correspond to our present
dialect classication scheme. In our work, the
Swedish prosody model (Bruce & Grding,
1978; Bruce & Granstrm, 1993; Bruce, 2007)
and different forms of speech synthesis play
prominent roles. Our main sources for analysis
are the two Swedish speech databases Speech-
Dat (Elenius, 1999) and SweDia 2000 (Eng-
strand et.al, 1997). SpeechDat contains speech
recorded over the telephone from 5000 speak-
ers, registered by age, gender, current location
and self-labelled dialect type. The research pro-
ject SweDia 2000 collected a word list, an elic-
ited prosody material and extensive spontane-
ous monologues from 12 speakers (younger and
elderly men and women) each from more than
100 different places in Sweden and Swedish-
speaking parts of Finland, selected for dialectal
speech.
The Swedish prosody model
The main parameters for the Swedish prosody
model are for word prosody 1) word accent
timing, i.e. timing characteristics of pitch ges-
tures of word accents (accent I/accent II) rela-
tive to a stressed syllable, and 2) pitch patterns
of compounds, and for utterance prosody 3) in-
tonational prominence levels (focal/non-focal
accentuation), and 4) patterns of concatenation
between pitch gestures of prominent words.
Background
An important part of our project work concerns
auditive and acoustic analysis of dialectal
speech samples available from our two exten-
sive speech databases described in Section 1.
This work includes collecting empirical evi-
dence of prosodic patterns for the regional va-
rieties of Swedish described in the Swedish
prosody model, as well as identifying intona-
tional patterns not yet included in the model.
To facilitate our work with testing and further
developing the model, we needed a tool for
generating rule-based intonation.

Figure 2. Schematic overview of the SWING tool.

Figure 3. Example of an annotated input speech.
Design
SWING comprises a number of parts, which are
joined by the speech analysis software Praat
(Boersma & Weenink, 2007), also serving as the
graphical interface. Annotated speech samples
and rules for generating intonation are used as
input to the tool. The tool generates and plays
resynthesis with rule-based and speaker-
normalised intonation of the input speech
sample. Additional features include visual dis-
play of the output on the screen, and options
for printing various kinds of information to the
Praat console (Info window), e.g. rule names
and values, or the time and F
0
of generated
pitch points. Figure 2 shows a schematic view
of the tool design.
Speech material
The input speech samples are annotated manu-
ally. Stressed syllables are labelled prosodically
and the corresponding vowels are transcribed
orthographically. Table 1 shows the prosodic
labels used in the current version of the tool,
while Figure 3 displays an example utterance
with prosodic annotation: De p kvllarna
som vi snder Its in the evenings that we are
transmitting.
Table 1. Labels used for prosodic annotation of
the speech samples to be analysed by the tool.
Label Description
pa1 primary stressed (non-focal) accent 1
pa2 primary stressed (non-focal) accent 2
pa1f focal accent 1
pa2f focal accent 2
Rules
The Swedish prosody model is implemented as
a set of rule files one for each regional variety
in the model with timing and F
0
values for
critical points in the rules. These files are sim-
ply text files with a number of columns, where
the first one contains the rule names, and the
following columns contain three pairs of values,
corresponding to the timing and F
0
of equally
many critical pitch points of the rules. The
three points are called ini (initial), mid (medial),
and fin (final). They contain values for the tim-
ing (T) and F
0
(F0). Timing of F
0
points is ex-
pressed as a percentage into the stressed sylla-
ble, starting from the onset of the stressed
vowel. If no timing value is explicitly stated in
the rule, the pitch point is by default aligned
with the onset of the stressed vowel. Three
values are used for F
0
: L (low), H (high) and H+
(extra high, used in focal accents). The mid

pitch point is optional; unless it is needed by a
rule, its values may be left blanc. Existing rules
are easy to adjust, and new rules can be added.
Table 3 shows an example of the rules for
South Swedish. Several rules contain a second
part, which is used for the pitch contour of the
following (unstressed) interval (segment) in the
annotated input speech sample. This extra part
has next attached to its rule name. Examples
of such rules are pa1f and pa2f in Table 2.
The SWING tool procedure
Analysis with the SWING tool is fairly straight-
forward. The user selects one input speech
sample and one rule file to be used with the
tool, and which (if any) information about the
analysis (rules, pitch points, debugging infor-
mation) to be printed to the console. A Praat
script generates resynthesis of the input speech
sample with a rule based output pitch contour.
Generation of the output pitch contour is based
on 1) the pitch range of the input speech sam-
ple, which is used for speaker normalisation, 2)
the annotation, which is used to find the time
and prosodic gesture to generate, and 3) the rule
file, which is used for the values of the pitch
points in the output. The Praat graphical user
interface provides immediate audio-visual feed-
back of how well the rules work, and also al-
lows for easy additional manipulation of pitch
points with the Praat built-in Manipulation fea-
ture. Figure 4 shows a Praat Manipulation ob-
ject for an example utterance. The light grey line
under the waveform shows the original pitch,
while the circles connected with the solid line
represent the rule-generated output pitch con-
tour. In the Praat interface, the user can easily
compare the original and the resynthesized
sounds and pitch contours, and further adjust
or manipulate the output pitch contour (by
moving one or several pitch points) and the an-
notation files. The rule files can be adjusted in
any text editor.
Testing the Swedish prosody
model with SWING
SWING is now being used in our work with test-
ing and developing the Swedish prosody model.
Testing is done by selecting an input sound
sample and a rule file of the same intonational
variety. If the model works adequately, there
should be a close match between the F
0
contour
of the original version and the rule-based one
generated by the tool. Figure 5 shows examples
of such tests of an utterance in the Svea and
South Swedish varieties. Interesting pitch pat-
terns found in our material which have not yet
been implemented in the rules are also analysed
using the tool.
Table 2: Example rule le for South Swedish with timing (T) and F
0
(F0) values for initial (ini), mid
(mid) and nal (n) points.
Rule name iniT iniF0 midT midF0 nT nF0
global (phrase) L L
concatenation L L
pa1f (focal accent 1) -10 L 20 H+ 50 L
pa1f next (extra gesture) L L
pa2f (focal accent 2) L 40 L H+
pa2f next (extra gesture) H+ 30 L L
pa1 (non-focal accent 1) -30 L 10 H 40 L
pa2 (non-focal accent 2) L 50 L H
pa2 next (extra gesture) H 30 L L

Figure 4. Praat Manipulation display of a South Swedish utterance with rule-generated Svea intonation
(circles connected by solid line; original pitch: light-grey line).


Discussion and future work
Although SWING still needs work, we already
find it useful in our project work of analysing
speech material as well as testing our model.
We consider the general results of our model
tests to be quite encouraging. The tool has so
far been used on a limited number of words,
phrases and utterances and with a subset of the
parameters of the Swedish prosody model, but
was designed to be easily adapted to further
changes and additions in rules as well as speech
material. We are currently including more
speech samples from our two databases, and
implementing other parameters of the Swedish
prosody model, such as rules for compound
words. Our near future plans include evaluation
of the tool by means of perception tests with
natural as well as rule-generated stimuli.

References
Boersma P. and Weenink D. (2007) Praat: do-
ing phonetics by computer (version 4.6.17)
[computer program]. http://www.praat.org/,
visited 12-Mar-08.
Bruce G. (2007) Components of a prosodic ty-
pology of Swedish intonation. In Riad T. and
Gussenhoven C. (eds) Tones and Tunes,
Volume 1, 113-146, Berlin: Mouton de
Gruyter.
Bruce G. and Grding E. (1978) A prosodic ty-
pology for Swedish dialects. In Grding E.,
Bruce G. and Bannert R. (eds) Nordic Pros-
ody, 219-228, Lund: Department of Linguis-
tics.
Bruce G. and Granstrm B. (1993) Prosodic
modelling in Swedish speech synthesis. A
prosodic typology for Swedish dialects.
Speech Communication 13, 63-73.
Bruce G., Granstrm B., and Schtz S. (2007)
Simulating intonational varieties of Swedish.
Proc. of ICPhS XVI (Saarbrcken, Ger-
many)
Engstrand O., Bannert R., Bruce G., Elert C-C.,
and Eriksson A. (1997) Phonetics and pho-
nology of Swedish dialects around the year
2000: a research plan. Papers from
FONETIK 98, PHONUM 4, Ume: De-
partment of Philosophy and Linguistics, 97-
100.
Elenius K. (1999) Two Swedish SpeechDat da-
tabases - some experiences and results. Proc.
of Eurospeech 99, 2243-2246.

Figure 5. Original and rule-based intonation of the utterance De p kvllarna som vi snder It s
in the evenings that we are transmitting for Svea and South Swedish (original pitch: dotted line;
rule-generated pitch: circles connected with solid line).

Recognizing phrase and utterance as prosodic units in
non-tonal dialects of Kammu
Anastasia Karlsson
1
, David House
2
and Damrong Tayanin
1

1
Department of Linguistics and Phonetics, Lund University
2
Speech, Music and Hearing, KTH, Stockholm

Abstract
This paper presents a study of prosodic phras-
ing in a non-tonal dialect of Kammu, a Mon-
Khmer language spoken in Northern Laos.
Prosodic phrasing is seen as correlated with
syntactic and informational structures, and the
description is made referring to these two lev-
els. The material investigated comprises sen-
tences of different lengths and syntactic struc-
tures, uttered by seven male speakers. It is
found that, based on prosodic cues, the distinc-
tion between prosodic utterance and prosodic
phrase can be made. Prosodic phrase is sig-
naled by a sequence of low + high pitch while
the right edge of the prosodic utterance gets
low pitch. This low terminal is replaced by a
high terminal in expressive speech. The study is
performed using elicited speech.
Introduction
Kammu, a Mon-Khmer language, has dialects
with lexical tones (low and high lexical tone)
and dialects with no lexical tones. Tones arose
at a late stage of the languages development in
connection with loss of the contrast between
voiced and voiceless initial consonants in a
number of dialects (Svantesson and House,
2006). There are no other phonological differ-
ences between toneless and tonal dialects and
this makes the different Kammu dialects well
suited for studying and comparing the use of
phrase intonation.
In this paper we present an investigation of
prosodic phrasing in the non-tonal dialects of
Kammu. We concentrate on the use of pitch in
signaling prosodic grouping. It is assumed that
a spoken utterance can be prosodically signaled
as one prosodic unit or divided into smaller
prosodic units. We do not have any pre-
assumptions about the types and numbers of
prosodic units in Kammu. Instead we assume
that, due to the elicited type of the material, the
utterances are read as autonomous utterances,
and it is interesting to find out whether the
rightmost utterance boundary is signaled
prosodically differently from the utterance in-
ternal boundaries. Due to the SVO word order
Kammu typically places new information at the
right edge of the utterance.
Words in Kammu are monosyllabic or ses-
quisyllabic (Svantesson and Karlsson, 2004).
Sesquisyllabic words consist of one minor and
one major syllable. The minor syllable has
schwa as its nucleus. Schwa insertion is often
absent in casual speech, but appears consis-
tently in some types of singing (Lundstrm and
Svantesson, 2008). There is also a phonological
distinction between short and long vowels.
Method
Material
Material was collected in Laos in 2007. For the
present investigation 16 sentences with differ-
ent lengths and different syntactic structures
were chosen. Kammu lacks a written script and
informants were asked to translate the material
from Lao to Kammu. Kammu speakers are bi-
lingual with Lao being their second language.
This resulted in somewhat different but still
compatible versions of the target sentences. The
resulting utterances were checked and tran-
scribed by Damrong Tayanin who is a native
speaker of Kammu.
Each target sentence was read three times by
the speakers. A total of 226 utterances were in-
vestigated.
Subjects
A total of nine speakers, seven men and two
women were recorded. Their ages ranged from
14 to 57 years. For the present investigation
seven speakers, all men, were chosen. They are
labeled as S1, S3, S5, S6, S7, S8 and S9.
Recording and analysis
The subjects were recorded with a portable Edi-
rol R-09 digital recorder. Five of the speakers
were recorded in quiet hotel rooms, S3 was re-
corded in his native village, and S7 was re-
corded at his home.

The recordings were analyzed using the
Praat program. For each utterance, an f
0
con-
tour was extracted. Main pitch features such as
turning points, lows and highs, relations be-
tween lows and highs specified by finding the
lowest and the highest point in the f
0
contours,
and shapes of pitch gestures (fall, rise or level
tone) were measured.
The observed tonal features were analyzed
by referring to the syntactic and information
structure of the utterances. Thus, the division of
the sentences into syntactic phrases (NP, VP
etc) and types of words (function- or lexical
words) were matched to the tonal events ob-
served. The placement of pragmatic focus was
unambiguous from the sentence contents.
Results
Prosodic utterance
Regarding the signaling of the right utterance
edge, the speakers can be divided into two
groups. Speakers in the first group (S1, S5, S7,
S9) have high pitch gestures utterance finally.
Speakers in the second group (S3, S6, S8) have
both low and high terminals. In the second
group, S3 and S6 have a prevalence for low
terminals and S8 has mostly high terminals.
In our previous investigation (Karlsson et
al., 2007), we found two main types of focal
accent in the non-tonal dialects. The material
comprised recordings made by Kristina Lindell
in the 1970s of one male speaker telling a folk-
tale. Based on the contents, we assumed that the
high focal accent was more expressive than the
low one. The present material gives more sup-
port to our assumption about the different prag-
matic load of the focal accents. Thus, we find
two different gestures, a falling pitch and a high
pitch in the same position in the utterance. This
is a default position for focus and both gestures
thus function as focal accent. The high focal
accent differs from the low one by its expres-
sive load. For instance, S3 is reserved when
reading the written material. He uses high ter-
minals only in some cases. On the other hand,
in his spontaneous storytelling, S3 is very re-
laxed and uses high terminals instead of low
ones.
A default pattern of a neutral utterance ut-
tered as one prosodic unit is a declining f
0

course with a low (falling) terminal. The low
terminal tone signals the right boundary of a
prosodic utterance. A prosodic phrase can not
have a low terminal (more on this in the next
section), until the phrase can (syntactically and
semantically) function as an autonomous utter-
ance.
Speakers with high terminals usually reach
the highest f
0
value in the utterance at the final
rise. This is typical even for utterances with
multiple focal accents, such as listings of ob-
jects that a person owns. Figure 1 illustrates
variants of the sentence hmra o d wt,
traak o d wt A horse Ill buy and a buffalo
Ill buy. In the top panel the f
0
course of S3 is
presented. The sentence is realised as two pro-
sodic utterances with low terminals. The utter-
ance boundaries are shown with arrows. In the
bottom panel, the f
0
course of S1 is presented.
This speaker is expressive and has high bound-
ary tones. He uttered the sentences as consisting
of four prosodic phrases, each ending with a
high tonal gesture (shown with arrows).

Figure 1. F0 courses of the sentence (its mutual
variants) hmra o d wt, traak o d wt A
horse Ill buy and a buffalo Ill buy uttered by S3
(the upper panel) and S1 (the bottom panel).
Prosodic phrase
When the utterance is divided into smaller pro-
sodic units (named prosodic phrases here), it is
signaled by a combination of low + high pitch.
The basic prosodic phrase comprises two words
with the first word getting low pitch and the
second word getting high pitch. These two-
word groups overlap with the syntactic group-
ing in the sense that prosodic grouping does not
pose boundaries which are syntactically impos-
sible. The prosodic phrasing pattern low + high
can be supposed to reflect the placement of the
focal accent at the right edge of a focused unit.
This is, however, a subject for future study. The
60
200
0 2.43575
F
0

(
H
z
)
Time(s)
60
200
0 1.97006
F
0

(
H
z
)
Time(s)

right boundary of a prosodic phrase is signaled
by a high pitch, and the right boundary of a pro-
sodic utterance gets the highest f
0
value in ex-
pressive speech. In Figure 2 f
0
courses of the
utterance , o phaan hmra Yes, Ill kill
the horse (top panel) and of the utterance o
phaan hmra too mee wt knaay Ill kill the
horse that you bought, Speaker 9 (an expres-
sive one). Both utterances are divided into three
prosodic phrases; their final high tonal gestures
are shown with arrows. The word horse is
utterance final in the top panel and gets the
highest f
0
values. In the bottom panel the same
word is in utterance medial position and does
not get the highest f
0
values; it is the utterance
final word that gets in most cases the highest f
0
value. The high pitch on horse in the first
case is the utterance-final high, while in the
second case it is a phrase-final high.
Figure 2. f
0
courses of the utterances , o
phaan hmra Yes, Ill kill the horse (top
panel) and of the utterance o phaan hmra
too mee wt knaay Ill kill the horse that you
bought, S9.

The basic phrasal pattern low + high can be
modified due to a number of factors. Thus,
when more than two words are included in a
prosodic phrase, the non-final words get low
pitch and the last one is marked with high pitch.
One-word phrases occur. They get only high
pitch, and there are no prosodic phrases with
only low pitch. One-word phrases are typically
the utterance-initial word, thus the pronominal
o I is often phrased as a one-word phrase.
Adverbials, placed utterance initially and syn-
tactically being self-sufficient units, are always
marked by a high pitch and phrased as a one-
word phrase. Figure 3 illustrates the sentence
sgii g waar Today it is sunny uttered as sgii
waar by S1. Both words get high pitch and the
utterance is by virtue of this divided into two
prosodic phrases.

Figure 3. f
0
courses of the utterance sgii waar To-
day it is sunny uttered by S1.
Clusters of two high tones within the same
prosodic phrase occur but only at utterance final
position. It happens in cases when the last two
words are pragmatically highlighted. The pat-
tern low + high is the preferred one, and no
more than two high tones are found in the same
prosodic phrase.
Final word
In elicited Kammu speech the division into
smaller and larger prosodic units, the prosodic
phrase and the prosodic utterance, is clearly sig-
naled by prosodic means. The next step is to
test this description for spontaneous speech. In
a previous study (Karlsson et al., 2007) we
could not give any prosodic cues to discrimi-
nate between the prosodic phrase and the pro-
sodic utterance and chose to operate only with
the prosodic phrase. Given that a prosodic
phrase cannot have a low tone finally, the pre-
vious material could be reanalyzed.
Acknowledgements
This work has been carried out within the re-
search project, Separating intonation from
tone (SIFT), supported by The Bank of Swe-
den Tercentenary Foundation.
References
Karlsson A.M., House D., Svantesson J-O. and
Tayanin D. (2007). Prosodic phrasing in to-
nal and non-tonal dialects of Kammu. Pro-
ceedings of the 16th International congress
of phonetic sciences, 1309-1312.
Lundstrm, Hkan & Svantesson, Jan-Olof.
(2008). Hrl singing and word-tones in
Kammu. Working Papers 53: 117131.
60
200
0 1.03088
F
0

(
H
z
)
Time(s)
60
200
0 2.22836
F
0

(
H
z
)
Time(s)
60
200
0 2.17673

Lund University: Dept of Linguistics and
Phonetics.
Svantesson, J-O., & House, D. (2006). Tone
production, tone perception and Kammu
tonogenesis. Phonology, 23, 309-333.
Svantesson, J-O & Karlsson, A.M. (2004). Mi-
nor Syllable Tones in Kammu. Proceedings
of International Symposium on Tonal As-
pects of Languages, Beijing 28-30 March,
177-180.
.

Phonological association of tone. Phonetic implications
in West Swedish and East Norwegian
Tomas Riad
1
and My Segerup
2
1
Department of Scandinavian languages, Stockholm University
2
Centre for languages and literature, Lund University

Abstract
This paper looks for preliminary phonetic evi-
dence in support of a phonological difference
between tonal structure in West Swedish (Gte-
borg) and East Norwegian (Oslo) compounds
that occasions a distributional difference be-
tween tonal accents in the two dialects (Riad
1998, 2006). We propose that the chief differ-
ence concerns the phonological association of
the so-called prominence L tone in the HLH%
sequence. We found a tendency for the L mini-
mum to occur further to the left in East Norwe-
gian, than in West Swedish, in accordance with
the prediction.
Introduction
The intonational affinity between West Swed-
ish and East Norwegian is clear (Grding &
Stenberg 1990). The realization of tone accent
is melodically very similar. In citation form an
accent 2 word contains a HLH sequence, where
the first H (lexical or postlexical) tone occurs
on the main stressed syllable and the second H
(boundary) tone is aligned with the right edge
of the word. The second H marks the boundary
of the intonation phrase, and also carries some
of the focus function (Fretheim & Nilsen 1989).
On the face of it, one might think that the two
dialects are quite similar, and indeed they are
categorized together in Grdings Scandinavian
Accent Typology (1977). In fact, Grding and
Stenberg (1990) found that the two dialects are
quite similar, apart from the pitch height in
conjunction with the boundary H, which is
markedly higher in East Norwegian than in
West Swedish.
Nevertheless, there are a couple of facts that
signal a bigger difference than meets the ear,
and the point of this pilot study is to phoneti-
cally follow up on the proposal made for pho-
nology in Riad (1998, 2003, 2006), and to pre-
pare for a more thorough investigation.
Background
The chief reason to suspect that there is a gram-
matical difference in tonal structure between
the two dialect areas is the distribution of tone
accent in compounds and other structures con-
taining more than one phonological stress. In
Swedish dialects, except South Swedish, and in
Norwegian dialects north of Trndelag, pretty
much all regular compounds get accent 2. In
these cases, Accent 2 is assigned postlexically,
by virtue of the prosodic constellation of two
stresses within the structure. In the western and
southern Norwegian dialects and in South
Swedish, both accents occur in compounds and
compound-like structures. In these cases, the
assignment of tone accent is influenced by a
combination of lexical, morphological and pro-
sodic factors (cf. Withgott & Halvorsen 1988,
Kristoffersen 1992 for Norwegian; cf. Bruce
1974, Delsing & Holm 1988, Riad 1998, 2003,
2006 for Swedish). This difference occasions
an isogloss that cuts through the Scandinavian
peninsula dividing dialects into either type.
East Norwegian and West Swedish are on ei-
ther side of it (for maps, cf. Riad 2003:125,
2005:23, 2006:40).
1
Hypotheses from phonology
In Riad (1998) it is proposed that the isogloss
marks a difference in the tonal alignment of the
prominence tone in compounds and com-
pound-like structures. We use the term promi-
nence tone in a function-neutral way here. It
simply denotes the tone that follows the lexi-
cal/postlexical tone of accent 2. In the dialects
at hand, it is the L tone between the two H
tones. According to the original proposal (the
alignment hypothesis), the difference lies in
left-alignment of L in East Norwegian and left-
and-right-alignment of L in West Swedish.
Thus, in East Norwegian there is interpolation
from L to the final H boundary tone, whereas in
West Swedish the L tone spreads between the
H tones, occasioning a tonal floor.

In this article, we assume an alternative hy-
pothesis (assumed in Riad 2008) where the
claim is that the difference between the types is
really one of association (the association hy-
pothesis). The phonological claim, then, is that
general accent 2 in compounds and similar
structures follows from tonal association to
both the initial primary stress and the rightmost
secondary stress. This is what we find in the
eastern-northern area of Scandinavia. Con-
versely, in dialects that allow either tonal ac-
cent in compounds, there is only one associa-
tion point, namely the initial primary stress.
This is what we find in the western-southern
area. The prediction for these latter dialects,
then, is that they instantiate tone accent in
much the same way in both simplex forms and
compounds.
2
The representational difference between the
two dialects is illustrated in Figure 1.
Gteborg
H LH%
| |
sommar-ledig-heten sommar-ledig-heten

Oslo
H L H%
|
sommar-ledig-heten sommar-ledig-heten
Figure 1. Schematic representations of the com-
pound sommarledigheten the summer vacation in
Gteborg
3
and Oslo. The contour is accent 2. The
prominence tone is underscored (L). The stylized
contours to the right give an idea of how these rep-
resentations are expected to come out melodically.
Predictions and expectations
As far as this pilot study is concerned, the chief
prediction concerns East Norwegian. We ex-
pect the lowest point between the two H tones
to be to the left rather than to the right. Just as
in simplex forms in several other dialects, the L
prominence tone should follow the initial H di-
rectly. From that point, there should be interpo-
lation to the final H%, since the L tone neither
spreads nor associates from or to any point fur-
ther to the right. Hence, we expect the L pitch
minimum (henceforth L
min
) to be leftward.
In West Swedish, the expectation under
both the alignment and association hypotheses
is that we should see an intonation floor (low
plateau) between the H tones. This makes no
particular prediction regarding the location of
the lowest point. In principle, it should quite
possibly occur to the right. One possible differ-
ence between the alignment and association
hypotheses is that L
min
should not occur further
to the right than the last stress under the asso-
ciation hypothesis.
At a general level, then, we expect the pitch
contour between the H tones to be flatter in
West Swedish than in East Norwegian.
Method
In order to find preliminary support for a dif-
ference regarding the association/alignment of
the L tone it is a good idea to look at long com-
pounds. The longer the compound, the greater
the opportunity for an unassociated contour to
rise in East Norwegian. Conversely, if the last
stress remains a relatively low point also in
long West Swedish compounds, then that is an
indication of association.
We have excerpted a number of compounds
from four speakers in local radio programs in
Gteborg (3 speakers) and Oslo (1 speaker).
Excerpted forms were all in focus position
(medial or final) such that they clearly con-
tained the HLH contour within the compound.
Our pitch analysis is carried out by means of
Praat (Boersma & Weenink 2001). We marked
the H points for each excerpted compound in
the sound object window and then identified
the lowest point between them. This can be
conveniently done by combining visual inspec-
tion with the move cursor to maxi-
mum/minimum pitch function in Praat. The
lowest point was annotated L
min
on the point
tier, and we also marked one more L point (out-
side of the syllable containing L
min
), so as to get
a reference point.
With these marks in place we get an idea of
whether the L
min
is rightward or leftward. Also,
by comparing the low points we get an idea of
the flatness of the floor or the steepness of the
rise.
Results
Pending permission, we cannot publish any of
the pitch contours from the radio material. In-
stead, we present illustrative recordings of a
Gteborg speaker in Figure 2 and of an Oslo
speaker in Figure 3.


ek sem pel me ning ar
H Lmin L H
100
400
200
300
P
i
t
c
h

(
H
z
)
Time (s)
0 1.791

Figure 2. Gteborg compound exempelmeningar,
with L
min
in the last stressed syllable.

sam ferd sels by rd
H L Lmin H
0
300
100
200
P
i
t
c
h

(
H
z
)
Time (s)
0 1.311
Figure 3. Oslo compound samferdselsbyrd, with
L
min
in the second syllable.

In Tables 1 and 2, we give the list of the inves-
tigated compounds. In each compound, the syl-
lable containing L
min
is underscored.

Table 1. Oslo. 9 tokens, one female speaker (FAn).
word, L
min
underscored gloss
granskingsrapportne the scrutiny reports
fremskrittspartiet a the progress party
fremskrittspartiet b the progress party
ordfrersprsmlet the president question
samferdselsbyrd a communications councellor
samferdselsbyrd b communications councellor
arbeidstilsynet the work inspection
holdeplassne the stations
hestedrosjne the horse taxis

Table 2. Gteborg. 9 tokens, three male speakers
(MAg 3, MBg 2, MCg 4).
word, L
min
underscored gloss
fotbollsfamiljen the football family
seniorsidan the seniors side
fotbollslskare football lovers
dagslget day form
sidosatta set aside
exempelmeningar sentence examples
betydelseassociationen meaning associations
meningsbyggnad syntax
obeslutsamme indecisive
Discussion
As seen, there is a clear tendency for the L
min
to
occur to the left in the Oslo data. L
min
is also
quite rightward, as a tendency, in the Gteborg
data. In Oslo, the tone starts to rise immediately
after the L. In long compounds containing sev-
eral stresses, too, it keeps rising past the final
stress. If there were an association point here,
we would expect final stresses to be L. In Gte-
borg Swedish we find that the final stress is in-
variably low and whether it is the L
min
or not,
there is clearly a tonal floor between the H
points. Thus, we conclude that it is worthwhile
pursuing the association hypothesis with a
fuller investigation.
Conclusion and prospects
One of the important tasks of phonology is to
generate hypotheses for phonetics. In this pilot
study, we have begun to follow up on such hy-
potheses regarding the tonal phonology of com-
pounds in West Swedish and East Norwegian.
In the follow-up investigation we plan to make
our own recordings and see if the findings can
be further substantiated. Regarding Gteborg,
we hope to be able to clearly separate the pre-
dictions of the alignment hypothesis from those
of the association hypothesis, by studying long
compounds with the final stress at different dis-
tances from the right edge. Regarding Oslo, we
hope to show that the last (though not necessar-
ily final) stress in long compounds may be in-
tegrated into the final rise. We expect that using
both normal and loud speaking mode as used
by Segerup (2004) will be a good way of bring-
ing out the prosodic profile of the structures
investigated. As mentioned, East Norwegian
also has tone accent 1 in compounds. In those
cases the L is associated to the main stressed

syllable, and we expect the rise to begin even
earlier, and the following stresses to be inte-
grated into the rising slope.
Acknowledgements
berg for comments
Notes
t the isogloss as such is independent
re
ten followed
We wish to thank Sara Myr
on the text and Per Olav Heggtveit for provid-
ing us with the illustrative example of an Oslo
compound for this article.
1. Note tha
of melodic differences regarding tone value.
2. The implications of the two hypotheses a
less important for this pilot study, but they will
be heeded in the follow-up study, where con-
trolled materials will be recorded.
3. In Gteborg, the H% tone is of
by a slight fall (Grding & Stenberg 1990, Ku-
ronen 1999). Since the H% is very clearly in
the final syllable, and we focus on the L tone
before it, we will disregard the very end of the
contour.
References
David Weenink. 2001. Praat:
aat/>
Bruce, Gsta. 1974. Tonaccentregler fr sam-
Delsing, Lars-Olof & Holm, Lisa. 1988. Om
Fre .
Grding, Eva & Michal Stenberg. 1990. West
Gr avian word
Kristoffersen, Gjert. 1992. Tonelag i
Kri onology of
Boersma, Paul &
Doing phonetics by computer.
<http://www.fon.hum.uva.nl/pr
mansatta ord i ngra sydsvenska stadsml,
in: Platzack, C. (ed.): Svenskans beskriv-
ning 8, 6275.
akut accent i sknskan, in: Sagt och skrivet.
(Festskrift till David Kornhall.) Institutio-
nen fr nordiska sprk, Lunds universitet.
theim, Thorstein & Randi Alice Nilsen
1989. Terminal rise and rise-fall tunes in
East Norwegian intonation. Nordic Journal
of Linguistics 12, 155181.
Swedish and East Norwegian intonation, in:
Kalevi Wiik & Illka Raimo (eds.), Nordic
Prosody V, 111130. Turku: Dept. of Pho-
netics, University of Turku.
ding, Eva. 1977. The Scandin
accents. Lund: CWK Gleerup.
sammensatte ord i stnorsk, Norsk
lingvistisk tidsskrift 10, 3965.
stoffersen, Gjert. 2000. The Ph
Norwegian. (The phonology of the worlds
languages.) Oxford University Press, Ox-
ford.
Kuronen, Mikko. 1999. Prosodiska srdrag i
gteborgska, in: Svenskans beskrivning 23.
Lund: Lund University Press, 18896.
Riad, Tomas. 1998. Towards a Scandinavian
accent typology, in: Kehrein, W. & Wiese,
R. (eds.) Phonology and Morphology of the
Germanic Languages, 77109. (Lin-
guistische Arbeiten 386) Niemeyer, Tbin-
gen.
Riad, Tomas. 2003. Diachrony of the Scandi-
navian accent typology, in: Fikkert, P. &
Jacobs, H. (eds.) Development in Prosodic
Systems (Studies in Generative Grammar
58). Berlin/New York: Mouton de Gruyter.
91144.
Riad, Tomas. 2005. Historien om tonaccenten,
in: Falk, Cecilia & Lars-Olof Delsing (eds)
Svenska sprkets historia 8. Studentlittera-
tur, Lund. S. 127.
Riad, Tomas. 2006. Scandinavian accent typo-
logy. Sprachtypol. Univ. Forsch. (STUF),
Berlin 59 (2006) 1, 3655.
Riad, Tomas. 2008. "Brk brk brk. Ehula
hule de chokolad muus". Sprktidningen nr
2, 2008, 3439.
Segerup, My. 2004. Gothenburg Swedish word
accents a fine distinction, in: Branderud,
P. & H. Traunmller (eds). Proceedings
Fonetik 2004 (Department of Linguistics,
Stockholm University) 2831.
Withgott, Meg & Halvorsen, Per-Kristian.
1988. Phonetic and phonological considera-
tions bearing on the representation of east
Norwegian accent, in: van der Hulst, H. &
Smith, N. (eds.): Autosegmental studies on
pitch accent, 279294. Foris, Dordrecht.


Vowels in rural southwest Tyrone
Una Cunningham
Department of Arts and Languages, Hgskolan Dalarna, Falun

Abstract
This study aims to pin down some of the pho-
netic variation and oddities associated with
Northern Ireland English (NIE) in general and
the English of rural southwest Tyrone (ERST)
in particular, Vowel quality and vowel quantity
relationships are crucial here. ERST may have
short or long vowels, depending on factors that
are not phonologically interesting in other va-
rieties of English. Vowel shifts from Middle
English are only partly carried through, lead-
ing to sociophonetic variation.
Northern Ireland English
The Northern Irish English (NIE) accent is
quite distinctive in many ways. It is an accent
that is noticed outside of Northern Ireland, and
one that has often been generally stigmatised in
other parts of the UK. However there is a good
deal of variation within Northern Ireland. It is
well documented that the accents spoken in dif-
ferent parts of the province reflect different
combinations of the main accent forces that op-
erate. The peculiarities of the history of Ireland,
in particular the Plantation of Ulster in the sev-
enteenth century and the shift from Irish to Eng-
lish from the latter half of the nineteenth cen-
tury have left their mark in the way English is
spoken in different parts of Ulster to the present
day.
Southwest Tyrone
In a band stretching across Ulster from Belfast
to Donegal the dialects spoken in the Republic
of Ireland meet the Ulster Scots of the north-
ernmost counties in what is known as Mid-
Ulster English (MUE) which has been found to
share features of both dialects (Harris 1985).
Tyrone is one of the southernmost counties in
Northern Ireland. The varieties of Mid-Ulster
English found in Southwest Tyrone are particu-
larly broad, representing a variation between
older forms and newer ones. Rural speakers are
generally expected to be more conservative
than urban ones.
One of the most prominent features of NIE
is the unusual timing conditions that hold be-
tween long and short vowels. What Harris
(1985) refers to as Aitken s Law, and McCaf-
ferty (2001:133) refers to as The Scottish
Vowel Length Rule, formulated to account for
vowel length in Scots dialects is said to apply
here. This means that in certain phonetic envi-
ronments vowels that in RP would be half long,
such as [D] in bed are pronounced with a long
vowel, while vowels that would in RP be pro-
nounced with a long vowel are pronounced
with a noticeably short vowel, e.g. [f|d]. The
particular conditions of quantity in ERST will
be documented here.
The phonological system of ERST, along
with other varieties of NIE is not entirely iden-
tical to that of RP. As in Scots, the /u:/-/T/ dis-
tinction is not upheld. This is not a very linguis-
tically useful distinction, so very little commu-
nicative information is lost. Other distinctions
are made that are not made in RP, such as be-
tween horse and hoarse Wells (1982). In some
cases there is variation between two vowel
qualities, noticeably in words like pull which
are found as [pPl] (stigmatized) and [p|l]. The
first vowel of words like comfort can be either
of these in some speakers.
Material
Unlike many dialectological studies, which fo-
cus on elderly rural speakers, this study exam-
ines the vowels of young speakers. Two broth-
ers, aged 8 and 14 and their sisters aged 10 and
18 at the time of recording, were asked to read a
wordlist. The wordlist includes examples of all
the phonemes of RP and has a number of key
phonetic environments for the high front vow-
els /h9+ H/ in particular. This was part of a larger
material, including texts and spontaneous
speech. Recordings were made using a Zoom
H4 digital recorder.
Vowel quantity
McCafferty (2001) accounts for the quantity
conditions upheld in (London)Derry English,
another variety of Mid-Ulster English spoken in
Derry city, which is about 40 km from South-
west Tyrone. There are almost no phonemic
vowel length distinctions, but phonetic length-
ening is activated in certain environments (see
table one for a description of the situation in
Belfast vernacular, another variety of Mid-
Ulster English spoken about 80 km from

Southwest Tyrone. According to Aitken s Law
(Aitken 1981), vowel length is conditioned by
the phonetic environment after the vowel. This
process may happen alongside the more general
enhanced fortis clipping that can be found in
most or all varieties of English.
Table 1. Aitken s Law as applied to Belfast ver-
nacular, another variety of Mid-Ulster English, af-
ter Harris (1985:43)
/i/ /e/ /D/
_#
_z
see
breeze
_n
_d
day
daze
rain
fade
_s
-
Des
pen
dead
mess
Long
_t
keen
seed
geese
feet
face
fate pet Short

Looking then at the results obtained by the
informants for the vowel /i:/ in Aitken s long
(see, leaves, trees, believe) and short (green,
feel, sheep) contexts, we find the following,
shown speaker by speaker, with a female Estu-
ary English (EE) speaker as a control.
Figure 1. Average vowel length for the vowel /i/ in
Aitkens long (see, leaves, trees, believe) and
short (green, feel, sheep) contexts.
So what we see here is that there is a con-
siderable difference between the vowel length
in the long condition and the short condition for
all four of the siblings in the study. This would
seem to support the assumption that Aitken s
law should apply in ERST as a variety of MUE.
However, the definition of the long and short
contexts overlaps partly with the distinction be-
tween fortis and lenis postvocalic consonants.
All of the long contexts are those in which the
enhanced length difference between vowels
preceding fortis and lenis consonants would
also lead to the vowel being long. The short
contexts, however have both fortis and lenis
post vocalic consonants. The control speaker, a
50 year-old female Estuary English speaker,
also had a considerable difference between
vowel length in long and short contexts, but the
difference is less. It appears that her long /i:/ is
about as long as the ERST /i:/, but that her short
vowels are not as short as those of the ERST
speakers.
So what happens then for vowels that are
short in RP and other accents? Consider the
case (using the denotation system widely used
in studies of varieties of English developed in
Wells 1982) of the DRESS vowel /D/. Accord-
ing to Aitken s law, this vowel will be long in
certain postvocalic contexts, such as bed, and
short in others, such as get. Unfortunately the
recorded material does not have many word list
versions of this vowel. For the FACE vowel
(Wells 1982), /e/, which is a monophthong in
ERST and other varieties of NIE there is data
however. The words day (long) and places and
great (short) can serve as examples of the way
this length condition works in ERST. Again, by
comparison, an EE speaker as control.
Figure 2. Average vowel length for the vowel /e/ in
Aitken s long (day) and short (places, great)
contexts.
Here the difference between the EE speaker
and the ERST speakers is less obvious from the
figure, but the fact that the ERST long vowel is
monophthongal makes it quite prominently
long.
Vowel quality
In NIE in general, there are a number of promi-
nent characteristics of the vowel inventory. One
is that RPs /u:/ and /T/ (Wells 1982s GOOSE
and FOOT) merge to /|/ so that boot and foot
rhyme (McCafferty 2001). Another is the varia-
tion between [U] and [T] that appears to have
sociophonetic significance. Consider the vowel
plots in Figure 3 of the F
1
vs F
2
formant fre-
quencies found in the word list elicitations of
the 14 year-old male speaker.
0 50 100 150 200 250 300 350
b8
b14
g10
g18
EE
ms
Short
Long
0 100 200 300 400 500 600
b8
b14
g10
g18
EE
ms
Short
Long

Figure 3. Formant frequency plots for the vow-
els the word list elicited from 14 year-old boy.

Notice the quality merging of vowels in
GOOSE words (that would have /u:/ in RP) and
FOOT words (that would have /T/ in RP). No-
tice also the variation between [U]-like pronun-
ciations of FOOT words, shown on the vowel
chart as T (the well-documented case of the
word pull (e.g. McCafferty 2001)), and the [T]-
like pronunciation of STRUT words, shown on
the vowel chart as U such as comfort. So
then, as explained by McCafferty (2001:158)
words of the FOOT class are variably realized
with the GOOSE vowel [|] and the STRUT
vowel [U].
McCafferty (2001: 157-166) deals with the
variation between [U] and [T] at length. An [U]-
like pronunciation of words like pull has been
found in both rural and urban speech. This fea-
ture is very common in the vernacular, but is
also stigmatised by the upwardly aspiring. In
fact the 14 year-old informant was mocked by
his listening aunt when he read pull as [pUl].
This may be why he adjusted his pronunciation
in the next occurrence of the word to something
like [pul].
Conclusion
This study shows that the speech found in rural
southwest Tyrone demonstrates many of the
features found in previous studies. In particular,
Aitkens law appears to apply (although a fol-
low up study will hopefully fill in gaps in the
data set to further test the relationship between
Aitken s law and fortis clipping in ERST). The
GOOSE-FOOT merger and the GOOSE-
STRUT variation are features found both in
ERST and in accounts of the speech of other
communities where Mid-Ulster English is spo-
ken.
Acknowledgements
This work would have been impossible without
the cooperation and generous interest of my in-
formants in Southwest Tyrone.
References
Aitken (1981) The Scottish vowel length rule.
In So meny people longages and tonges:
Philological essays in Scots and mediaeval
English presented to Angus McIntosh, M.
Benskin and M. L. Samuels (eds), 131- 157.
Edinburgh: Middle English Dialect Project.
Harris, J. (1985). Phonological variation and
change: studies in Hiberno-English. Cam-
bridge: Cambridge Univ. Press.
McCafferty, K. (2001). Ethnicity and Language
Change. English in (London)Derry, North-
ern Ireland. Philadelphia, PA, USA: John
Benjamins Publishing Company.
Wells, J. (1982). Accents of English Vol. II
Cambridge: Cambridge Univ. Press.

F
2

F
1



The beginnings of a database for historical sound
change
Olle Engstrand
1
, Ptur Helgason
2
and Mikael Parkvall
1
1
Department of Linguistics, Stockholm University
2
Department of Linguistics and Philology, Uppsala University

Abstract
We report a preliminary version of a database
from which examples of historical sound
change can be retrieved and analyzed. To date,
the database contains about 1,000 examples of
regular sound changes from a variety of lan-
guage families. As exemplified in the text,
searches can be made based on IPA symbols,
articulatory features, segmental or prosodic
context, or type of change. Ultimately, the da-
tabase is meant to provide an adequately large
sample of areally and genetically balanced in-
formation on historical sound changes that
tend to take place in the worlds languages. It
is also meant as a research tool in the quest for
diachronic explanations of genetic and areal
biases in synchronic typology.
Background and purpose
From its early beginnings in the 18
th
century,
diachronic phonology has had considerable
success in documenting and reconstructing his-
torical sound changes in various languages and
language families, as well as in formulating
general theories for how and why speech
sounds change over time (see, e.g., Lehmann
1962, Anttila 1989, Lass 1997). However, the
resulting data are scattered, heterogeneous and
often hard to interpret. If the information could
be made available in a searchable and compa-
rable form, we would have access to a valuable
basis for typologically oriented studies of his-
torical sound change. With this objective in
mind, we have made a preliminary survey of
sound changes as observed in a number of lan-
guages and language families.
Method
So far, we have entered about one thousand ex-
amples of regular consonant changes into a Mi-
crosoft Excel data sheet. The rows (database
records) correspond to individual sound
changes, and the columns (fields) represent the
parameters of each change. For each change,
input sound (the changing sound) and out-
put sound (the sound resulting from the
change) are represented in terms of feature val-
ues and IPA symbols. Additional parameters
include right and left context, prosodic infor-
mation, type of change (such as elision, epen-
thesis, devoicing or assimilation) and, where
information is available, relative chronology.
Examples, questions and comments are noted
in separate fields.
An illustration is given in Table 1. The table
illustrates a small subset of database records
(rows) and their evaluations in terms of field
parameters (columns). All sound changes in the
table were sampled from the phonological de-
velopment of Vulgar Latin into Modern Italian
(Grandgent 1908, 1927, Bertinetto & Lopor-
caro 2005). It should be noted, however, that
our preliminary corpus contains more non-
European than European languages.
In the left columns, the labels C old man,
C old place, C old voice and C old sec,
refer to properties of input, i.e., old, sounds.
C stands for consonant, and man, place,
voice and sec represent manner of articula-
tion, place of articulation, voicing and secon-
dary articulation, respectively. Further to the
right, the new (output) sounds are specified
using these same dimensions. The From and
To columns show the respective input and
output sounds in IPA notation. Segmental con-
text is specified in the rightmost columns (con-
text before and context after the changing seg-
ment, respectively); V means vowel and the #
symbol represents a word boundary). For ex-
ample, the first row exemplifies a change from
[b] (a voiced bilabial stop, in this case without
any secondary articulation; thus, the latter vari-
able is evaluated as 0=zero) to a [] (a voiced
bilabial fricative, again with no secondary ar-
ticulation). The two rightmost columns indicate
that this change has occurred in intervocalic
position. The second row illustrates an elision:
a [] is dropped in intervocalic position. As this
leaves no remaining consonant, the articulatory
variables are not applicable (hence n.a.).
Rows 5 and 7 suggest that a certain sound
change may occur independently at different

times or in different dialect areas, rows 2 and 4
show that a given input sound may be differ-
ently affected depending on context, and rows 7
and 8 demonstrate that identical outputs can
originate from different sources. The remaining
rows can be read analogously.
Searches are performed based on transcrip-
tions or feature profiles using the programs
standard filter functions. For example, search-
ing /r/ as the input sound will return all histori-
cal developments from /r/ that are represented
in the database. For another example, a search
for [nasal] & [consonant] as the right-hand con-
text will yield all changes that have taken place
before nasal consonants. Searches may be con-
strained by combining several criteria; assume,
for example, that the input sound has the fea-
tures [velar] & [stop], that its right-hand con-
text is specified in terms of [vowel] & [front],
and that the output sound is [palatal] or [cor-
onal] & [obstruent]. These conditions will be
met by various types of an intuitively ubiqui-
tous historical sound change, that is velar pala-
talization in front vowel context (Guion 1998).
Table 1. A subset of database records and some field values for consonants.
En-
try
C old
man
C old
place
C old
voice
C old
sec
From To C new
man
C new
place
C new
voice
C new
sec
Ctx
be-
fore
Ctx
after
1 stop bilab vd 0 b fric bilab vd 0 V V
2 fric bilab vd 0 0 n.a. n.a. n.a. n.a. V V
3 glide labvel vd 0 w g stop vel vd labzd # V
4 fric bilab vd 0 v fric labdent vd 0 # V
5 fric cor vl 0 s fric postalv vl 0 # V
6 fric cor vl 0 s ts affric cor vl 0 # V
7 fric cor vl 0 s fric postalv vl 0 # V
8 affric postalv vl 0 fric postalv vl 0 # V
9 glide pal vd 0 j affric postalv vd 0 # V
10 affric postalv vd 0 fric postalv vd 0 # V

Looking for patterns: an example
The results of an exemplifying test run are
given in Table 2. The table, which is based on
993 recorded cases, illustrates the relative inci-
dence of the indicated kinds of historical sound
change. Types of change are shown in the left
column, and frequencies of occurrence pertain-
ing to the respective types are shown as per-
centages in the right column. It can be seen, for
example, that 18 percent of all cases are eli-
sions, that 10 percent represent a development
from stop to fricative (i.e., fricativization), and
that debuccalization, delabialization and de-
palatalization stand for 4, 6 and 3 of the
changes, respectively. (The term debuccaliza-
tion is used to refer to a process whereby a
consonants supraglottal place of articulation is
substituted by [h] or [].) If all these types are
regarded as lenitions, 39 percent of all sound
changes in the database are lenitions (this fig-
ure is slightly lower than the sum of the per-
centages because a change may occasionally
comprise more than one of these components).
It should be pointed out, however, that the true
proportion of lenitions is probably much
higher, because assimilations were not counted
in this test run (the reason being that all neces-
sary contextual information is not yet in place).
Even though these figures are not necessarily
representative of the languages of the world,
observations of this kind may enhance our un-
derstanding of the driving forces behind proc-
esses of historical sound change.
Table 2. Types of lenitions given as percentages of
993 recorded changes. The bottom row summarizes
the total percentage of lenitions in terms of these
types.
Type of change Percent
(N=993)
Elision 18
Fricativization 10
Debuccalization 4
Delabialization 6
Depalatalization 3
Total lenitions 39

Future developments and applica-
tions
One long-term goal of the present project is to
create a database that will be fairly representa-
tive of the historical sound changes that tend to
take place in the languages of the world. This
will require a large number of records based on
a genetically and areally balanced selection of
languages and language families. Clearly, the
interpretation of many of these sources will re-
quire expert advice and assistance.
As a phonetic contribution, a database for
diachronic phonology will help to identify the
preconditions of historical sound change that
are hidden in universal constraints on speech
production, perception and development (e.g.,
Hombert et al. 1979, Greenlee & Ohala 1980,
Locke 1983, Svantesson 1983, Lindblom 1990,
Ohala 1993, Willerman 1994, Lindblom et al.
1995, Engstrand & Williams 1996, Helgason
2002). This perspective also has a potential for
guiding experimental quests for parallels be-
tween historical sound changes and on-line
speech communication phenomena such as
coarticulation, reduction and restoration (Ohala
1989, Engstrand & Lacerda 1996, Engstrand et
al. 2007). In addition, given a full-size, repre-
sentative database, it will be possible to track
down developmental tendencies in genetically
or areally defined language groups and to com-
pare the diachronic data with the corresponding
typological patterns as observed in synchronic
databases such as UPSID (Maddieson 1984).
Thus, diachronic data may help to account for
patterns of areal typology that are not readily
accessible to synchronic explanation, e.g.,
crazy rules or telescope rules such as the
alternation p>s/_i in Bantu (Bach & Harms
1972, Hyman 1975, Ohala 1983, Blevins 2004)
as well as areal correlations among complex
segments such as prenasalized, implosive and
doubly articulated stops (Ladefoged 1964,
Sherman 1975, Maddieson 1984, Lindblom &
Maddieson 1988, Ladefoged & Maddieson
1996, Engstrand 1997, J anson & Engstrand
2001). Many such marked phonologies may
become quite transparent in a historical per-
spective.
Acknowledgements
This project is being carried out in collabora-
tion with J uliette Blevins, Max Planck Institute
for Evolutionary Anthropology, Leipzig. Many
thanks to Pier Marco Bertinetto, Scuola Nor-
male Superiore, Pisa, for valuable help and ad-
vice. This work has been supported in part by
Fondazione Famiglia Rausing through a grant
to the first author.
References
Anttila R. (1989) Historical and comparative
linguistics. Philadelphia: J ohn Benjamins.
Bach E. & Harms R.T. (1972) How do lan-
guages get crazy rules? In R P Stockwell &
R K S Macauly (eds.), Linguistic change
and generative theory, 1-21. Bloomington:
Indiana University Press.
Bertinetto P.M. & Loporcaro M. (2005) The
sound pattern of Standard Italian, as com-
pared with the varieties spoken in Florence,
Milan and Rome. J ournal of the Interna-
tional Phonetics Association 35, 131-151.
Blevins J . (2004) Evolutionary phonology: The
emergence of sound patterns. Cambridge:
Cambridge University Press.
Engstrand O. (1997) Areal biases in stop para-
digms. PHONUM 4, 187-190.
Engstrand O. & Lacerda F. (1996) Lenition of
stop consonants in conversational speech:
evidence from Swedish (with a sideview on
stops in the worlds languages). Arbeits-
berichte, Institut fr Phonetik und digitale
Sprachverarbeitung, Universitt Kiel
(AIPUK), 31, 31-41.
Engstrand O. & Williams K. (1996) VOT in
stop inventories and in young childrens vo-
calizations: preliminary analyses. Proceed-
ings of FONETIK 1996, Nsslingen, May
1996. Speech, Music and Hearing, Royal
Institute of Technology, Quarterly Progress
and Status Report 2/1996, 97-99.
Engstrand, O., Frid J . and Lindblom B. (2007)
A perceptual bridge between coronal and
dorsal /r/. In P. Beddor, M. Ohala and M.-J .
Sol (eds.), Experimental Approaches to
Phonology. Oxford University Press, 175-
191.
Grandgent C.H. (1908) An introduction to Vul-
gar Latin. Boston: Heath & Co.
Grandgent C.H. (1927) From Latin to Italian.
An historical outline of the phonology and
morphology of the Italian language. Cam-
bridge: Harvard University Press.
Greenlee, M. & Ohala, J .J . (1980) Phonetically
motivated parallels between child phonol-
ogy and historical sound change, Language
Sciences 2, 283-308.

Guion S. (1998) The role of perception in the
sound change of velar palatalization. Pho-
netica 55, 18-52.
Helgason P. (2002) Preaspiration in the Nordic
languages: Synchronic and diachronic as-
pects. PhD diss., Stockholm University.
Hombert J .-M., Ohala J .J . & Ewan W.G. (1979)
Phonetic explanations for the development
of tones. Language 55, 37-58.
Hyman L.M. (1975) Phonology: theory and
analysis. New York: Holt, Rinehart and
Winston.
J anson T. & Engstrand O. (2001) Some unusual
sounds in Changana. Proceedings of
FONETIK 2001, rens, May 30-J une 1,
2001. Working Papers, Department of Lin-
guistics, Lund University 49, 74-77.
Ladefoged P. (1964) A phonetic study of West
African languages. Cambridge: Cambridge
University Press.
Ladefoged P. & Maddieson I. (1996) The
sounds of the worlds languages. Oxford:
Blackwell.
Lass R. (1997) Historical linguistics and lan-
guage change. Cambridge: Cambridge Uni-
versity Press.
Lehmann W.P. (1962) Historical linguistics:
An introduction. New York: Holt, Rinehart
and Winston.
Lindblom B. (1990) Explaining phonetic varia-
tion: A sketch of the H&H theory. In W.J .
Hardcastle & A. Marchal (eds.), Speech
production and speech modeling, Kluwer:
Dordrecht, 403-439.
Lindblom B. & Maddieson I. (1988) Phonetic
universals in consonant systems. In L.M.
Hyman & C.N. Li (eds.), Language, speech,
and mind. New York: Routledge, 62-78.
Lindblom B., Guion S., Hura S., Moon S.-J . &
Willerman R. (1995) Is sound change adap-
tive? Rivista di Linguistica 7, 5-37.
Locke J .L. (1983) Phonological acquisition and
change. New York: Academic Press.
Maddieson I. (1984) Patterns of sounds. Cam-
bridge: Cambridge University Press.
Ohala J .J . (1983) The origin of sound patterns
in vocal tract constraints. In P.F.
MacNeilage (ed.), The Production of
Speech, 189-216. New York: Springer-
Verlag.
Ohala, J .J . (1989) Sound change is drawn from
a pool of synchronic variation. In L.E.
Breivik & E.H. J ahr (eds.), Language
Change: Contributions to the study of its
causes. [Series: Trends in Linguistics, Stud-
ies and Monographs No. 43]. Berlin: Mou-
ton de Gruyter, 173-198.
Ohala J .J . (1993) The phonetics of sound
change. In C. J ones (ed.), Historical Lin-
guistics: Problems and Perspectives. Lon-
don: Longman, 237-278.
Sherman D. (1975) Stop and fricative systems:
a discussion of paradigmatic gaps and the
question of language sampling. Stanford
University Phonology Archiving Project,
Working Papers on Language Universals
17, 1-31.
Svantesson J .-O. (1983) Kammu phonology
and morphology. Lund: Gleerup.
Willerman R. (1994) The phonetics of pro-
nouns: articulatory bases of markedness.
PhD diss., University of Texas, Austin.

Author index

Ambrazaitis, Gilbert 81
Ananthakrishnan, G 9
Barry, William J . 25
Beskow, J onas 33, 61, 85
Blomberg, Mats 37
Bruce, Gsta 61, 85
Cunningham, Una 97
Diehl, Randy 5
Edlund, J ens 17, 29, 33
Eir Cortes, Elisabet 1
Engstrand, Olle 101
Engwall, Olov 57
Enflo, Laura 61, 85
Elenius, Daniel 37
Frid, J ohan 41
Granstrm, Bjrn 33, 61, 85
Gustafson, J oakim 17, 33, 69
Heldner, Mattias 29
Helgason, Ptur 65
Hincks, Rebecca 21
House, David 89
J onsson, Oskar 33
Karlsson, Anastasia 89
Koreman, J acques 25

Laskowski, Kornel 29
Lindblom, Bjrn 1, 5
Lindseth, Marte Kristine 25
McAllister, Anita 77
Neiberg, Daniel 9
Park, Sang-Hoon 5
Parkvall, Mikael 65
Pind, J rgen L. 49
Riad, Tomas 93
Salvi, Giampiero 5
Schtz, Susanne 61, 85
Segerup, My 93
Seppnen, Tapio 53
Skantze, Gabriel 33
Stenberg, Michal 13
Strangert, Eva 69
Tayanin, Damrong 89
Toivanen, J uhani 53, 73
Tronnier, Mechtild 77
van Dommelen, Wim A. 45
Wik, Preben 57
Vyrynen, Eero 53
Ylitalo, Riika 65

Proceedings, IAFPA 2006, Department of Linguistics, University of Gothenburg

Proc Fonetik 2008

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Proc Fonetik 2008

Hochgeladen von

Copyright:

Verfügbare Formate

Proceedings

Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg

Das könnte Ihnen auch gefallen