Beruflich Dokumente
Kultur Dokumente
FONETIK 2008
The XXI
st
Swedish Phonetics Conference
June 1113, 2008
Department of Linguistics
University of Gothenburg
Proceedings FONETIK 2008
The XXI
st
Swedish Phonetics Conference,
held at University of Gothenburg, June 1113, 2008
Edited by Anders Eriksson and Jonas Lindh
Department of Linguistics
University of Gothenburg
Box 200, SE 405 30 Gothenburg
ISBN 978-91-977196-0-5
The Authors and the Department of Linguistics
Printed by Reprocentralen, Humanisten, University of Gothenburg.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
iii
Preface
This volume contains the contributions to FONETIK 2008, the Twenty-first Swedish
Phonetics Conference, organized by the Phonetics group at the University of
Gothenburg on J une 1113, 2008. The papers appear in alphabetical order of the
surname of the first author.
Only a limited number of copies of this publication have been printed for
distribution among the authors and those attending the conference. For access to
electronic versions of the contributions, please look under:
http://www.ling.gu.se/konferenser/fonetik2008/papers/Proc_fonetik_2008.pdf
We would like to thank all contributors to the Proceedings. We are also indebted to
Fonetikstiftelsen for financial support.
Gteborg, J une 2008
Anders Eriksson sa Abelin J onas Lindh
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
iv
Previous Swedish Phonetics Conferences (from 1986)
I 1986 Uppsala University
II 1988 Lund University
III 1989 KTH Stockholm
IV 1990 Ume University (Lvnger)
V 1991 Stockholm University
VI 1992 Chalmers and Gteborg University
VII 1993 Uppsala University
VIII 1994 Lund University (Hr)
1995 (XIIIth ICPhS in Stockholm)
IX 1996 KTH Stockholm (Nsslingen)
X 1997 Ume University
XI 1998 Stockholm University
XII 1999 Gteborg University
XIII 2000 Skvde University
XIV 2001 Lund University
XV 2002 KTH Stockholm
XVI 2003 Ume University (Lvnger)
XVII 2004 Stockholm University
XVIII 2005 Gteborg University
XIX 2006 Lund University
XX 2007 KTH Stockholm
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
v
Contents
Speech production
From articulatory to acoustic parameters non-stop 1
Elisabet Eir Cortes and Bjrn Lindblom
(Re)use of place features in voiced stop systems: Role of phonetic constraints 5
Bjrn Lindblom, Randy Diehl, Sang-Hoon Park and Giampiero Salvi
On the Non-uniqueness of Acoustic-to-Articulatory Mapping 9
Daniel Neiberg and G. Ananthakrishnan
Pronunciation in Swedish encyclopedias: phonetic transcriptions and sound 13
recordings
Michal Stenberg
Prosody I
EXPROS: Tools for exploratory experimentation with prosody 17
Joakim Gustafson and Jens Edlund
Presenting in English or Swedish: Differences in speaking rate 21
Rebecca Hincks
Preaspiration and perceived vowel duration in Norwegian 25
Jacques Koreman, William J. Barry and Marte Kristine Lindseth
The fundamental frequency variation spectrum 29
Kornel Laskowski, Mattias Heldner and Jens Edlund
Speech technology
Speech technology in the European project MonAMI 33
Jonas Beskow, Jens Edlund, Bjrn Granstrm, Joakim Gustafson, Oskar Jonsson and
Gabriel Skantze
Knowledge-rich model transformations for speaker normalization in speech 37
recognition
Mats Blomberg and Daniel Elenius
Development of a southern Swedish clustergen voice for speech synthesis 41
Johan Frid
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
vi
Speech perception
The perception of English consonants by Norwegian listeners: A preliminary report 45
Wim A. van Dommelen
Reaction times in the perception of quantity in Icelandic 49
Jrgen L. Pind
Emotion discrimination with increasing time windows in spoken Finnish 53
Eero Vyrynen, Juhani Toivanen and Tapio Seppnen
Looking at tongues can it help in speech perception? 57
Preben Wik and Olov Engwall
Variation and change I
Human Recognition of Swedish Dialects 61
Jonas Beskow, Gsta Bruce, Laura Enflo, Bjrn Granstrm and Susanne Schtz
F
0
in contrastively accented words in three Finnish dialect areas 65
Riikka Ylitalo
Speech acquisition, speech development and second language
learning
Improving speaker skill in a resynthesis experiment 69
Eva Strangert and Joakim Gustafson
Second-language speaker interpretations of intonational semantics in English 73
Juhani Toivanen
Measures of continuous voicing related to voice quality in five-year-old children 77
Mechtild Tronnier and Anita McAllister
Prosody II
On final rises and fall-rises in German and Swedish 81
Gilbert Ambrazaitis
SWING: A tool for modelling intonational varieties of Swedish 85
Jonas Beskow, Gsta Bruce, Laura Enflo, Bjrn Granstrm and Susanne Schtz
Recognizing phrase and utterance as prosodic units in non-tonal dialects of Kammu 89
Anastasia Karlsson, David House and Damrong Tayanin
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
vii
Phonological association of tone. Phonetic implications in West Swedish and East 93
Norwegian
Tomas Riad and My Segerup
Variation and change II
Vowels in rural southwest Tyrone 97
Una Cunningham
The beginnings of a database for historical sound change 101
Olle Engstrand, Ptur Helgason and Mikael Parkvall
Author index 105
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
viii
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
A) x-ray profile
C) Mid-sagittal distances D) Distance-to-area rules
E) Area functions derived
F) Area functions run
through software
G) Software returns formant
frequencies
B) Vocal tract midline
A = d
From articulatory to acoustic parameters non-stop
Elisabet Eir Cortes and Bjrn Lindblom
Department of Linguistics, Stockholm University
Abstract
This paper reports an attempt to map the time
variations of selected articulatory parameters
(from X-ray profiles) directly on the F1, F2 and
F3 formant tracks using multiple regression
analysis (MRA). The results indicate that MRA
can indeed be useful for predicting formant
frequencies. Since the results reported here are
limited to preliminary observations of F1 only,
further studies including F2 and F3 are needed
to evaluate the method more definitively.
Introduction
The traditional method of calculating the for-
mant pattern associated with a set of articulato-
ry measurements goes by way of the cross-
sectional area function (Fant 1960). Heinz and
Stevens (1964) proposed a procedure which de-
rives vocal tract cross-sectional areas from
cross-distances. At each point in the vocal tract,
the following formula (Equation 1) relates the
mid-sagittal distance d to the area A of the
cross-section at that particular point:
A = d
(Eq 1)
where and are constants dependent on
speaker and position along the vocal tract. The
7 steps of this method can be summarized as in
Fig.1.
Figure 1. The traditional method of deriving for-
mants from articulatory information.
The performance of this method was recent-
ly evaluated by Ericsdotter (2005). With the aid
of MR images from two subjects articulating
Swedish vowels, she was able to test Equation
(1) at a number of locations both in the pharynx
and in the oral cavity. By and large the method
was found to be descriptively adequate. She
found that the values of and varied with po-
sition in the vocal tract and between subjects.
By taking vowel identity into account the area
predictions were somewhat improved. Yet,
acoustically, this improvement seemed to be of
only slight acoustic importance (p.155). It is
interesting to note that in many cases Ericsdot-
ter in fact found that a roughening of the me-
thod, reducing the number of cross-sections and
thus the number of equations, did not severely
worsen the acoustic outcome. We will have
more to say about that observation anon.
Work on articulatory modeling (e.g. the
APEX model (Stark et al 1996) indicated that
variations in articulatory parameters such as the
jaw, the front-back dimension of the tongue and
the elevation of the tongue blade have an ap-
proximate but fairly simple relationship to for-
mant changes (Lindblom & Sundberg 1971).
Examples of such rules of thumb are:
F1 is controlled by the jaw
F2 is controlled by the front-back movements of
the tongue body
F3 is controlled by the tongue blade (elevation
opens a cavity under the tongue blade, which re-
turns a low F3 in i.e. retroflexed articulation).
The following question arises: Would it be
possible to express such rules of thumb more
quantitatively? Suppose we have access to arti-
culatory data, could we simplify the procedure
of Fig.1 by eliminating altogether the interme-
diate stages B-F, thus moving straight from A
to G in 2 steps (see Fig.2 below).
How well would formants be predicted by
such a drastic shortcut method? Will the accu-
racy of the predictions be satisfactory for vari-
ous applications, say educational (teaching
acoustic phonetics) and technical (articulatory
synthesis), to name some?
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Figure 2. The method under investigation
On the down side an immediate objection
springs to mind: the well-documented non-
linearity in the relationship between articulation
and acoustics (Fant 1960; Stevens 1998). The
non-monotone nature of this relationship sug-
gests that a direct connection will be hard to
find. However, there are also certain positive
considerations to make.
First, our research group is fortunate to have
access to a unique collection of X-ray record-
ings with synchronically recorded sound (Bran-
derud et al 1998). By extracting parallel articu-
latory and acoustic measurements, the opportu-
nity opens for investigating the question raised
earlier, and to give an empirical answer to how
well formants can be predicted non-stop from
articulatory data.
Second it might be useful to explore certain
statistical techniques, for instance multiple re-
gression analysis (MRA) which is a method
that can be used to numerically relate a depen-
dent variable (a formant frequency, for in-
stance) to several independent variables (articu-
latory parameter values, for instance). Impor-
tantly this method is not limited to linear re-
gressions since independent variables could be
defined as various mathematical functions of
the articulatory parameters.
The aim of the current study is thus as fol-
lows: To select articulatory data from an X-ray
film of a male speaker focusing on the frames
associated with vowels and vowel transitions,
make moment-by-moment comparisons with
the corresponding formant measurements, and
investigate the numerical relationships (linear
or not) that obtain between them. For the pur-
pose of the present report, the investigatory fo-
cus will be on the first formant.
Procedure
Our data come from a 20-second X-ray film of
a Swedish male speaker (for speech materials,
see Table 1). The images portraying a midsagit-
tal articulatory profile were sampled at 50
frames/sec, (see Branderud et al.1998 for de-
tails on the method).
Table 1. The speech material. Parentheses indicate
sounds not included in the analysis.
Vowel b-context d-con. g-con.
/i/ /ibi(pi:l)/ /di/ /i/
// /b(p:k)/ /d/ //
/a/ /da(s)/ /a(st)/
// /b(p:r)/ /d(l)/ /(l)/
/o/ /obo(po:l)/ /do(lk)/ /o(lv)/
/u/ /ubu(pu:l)/ /du(s)/ /u(s)/
The final part of the b-words, as well as the fi-
nal consonant in the other words, was not in-
cluded in the analysis; nonetheless, its presence
announcing itself aforehand in the preceding
vowel. We will return to this in the Results.
Tracings of all acoustically relevant struc-
tures were made using the OSIRIS software
package (University of Geneva) Contours de-
fined in Osiris were converted into tables with
x-and y-coordinates using PAPEX (Branderud
2002), calibrated in mm and corrected for head
movements (palate contour fit). For the tongue,
the contours were further processed by redefin-
ing them in a jaw-based coordinate system and
by resampling them at 25 equidistant flesh-
points, which were fed into a Principal Com-
ponent Analysis (PCA), providing the numeri-
cal specification of the tongue shapes (see
Lindblom 2003 for details on the method).
Articulatory parameters
In the classical work on vocal tract modeling
(Fant 1960) the location of the main constric-
tion, the size of that constriction and the length
and the opening area of the lip section have
been shown to be important determinants of the
output formant pattern. Bearing such findings
in mind we included the parameters listed in
Table 2 in our analyses.
The measurements were made from tracings
of midsagittal profiles of the subjects vocal
tract. The lip section was described in terms of
three parameters: the vertical separation of the
lips, the location of the mouth corner relative to
the upper incisors and the degree of protrusion
of the lips relative to the mouth corner. Jaw
opening is defined as the position of the lower
incisors relative to their location in clench.
The tongue contours were specified in terms
of Principal Components derived from a sample
based on 411 tongue contours. The input to this
analysis was a matrix in which columns corres-
ponded to the 25 fleshpoints and the rows con-
A) x-ray profile
G) Formant frequencies
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Table 2. Articulatory parameters
Articulatory parameters Description/calculation
Vertical separation of lips Midsagittal distance
Jaw opening IncInf rel to clench
Lip length (Ulip+Llip)/2 IPC
Location mouth corner IPC-IncSup
Tongue contour 2 Principal Components
Larynx height Vertical distance IncSup-Lx
Pharynx back wall I slope
Pharynx back wall II intercept
tained information identifying individual ton-
gue contours. Since the specification of each
fleshpoint requires two numbers (x & y), there
were twice as many rows as contours. Accor-
dingly, the data fed into the PCA was an 822-
by-25 matrix. This format had the convenience
of automatically sorting the PCA output into
two sets: one consisting of the horizontal
weights (for the x coordinates) and the other
containing the vertical weights (for the y
coordinates). In the present multi-regressions
we limit the description of the tongue to four
numbers derived from the horizontal and the
vertical sets of the first two Principal Compo-
nents.
Larynx height was measured as the vertical
distance between the horizontal plane through
the tip of the upper incisor and the horizontal
plane through the posteriormost point of the
larynx contour.
Since movements of the back wall of the
pharynx would affect the posterior vocal tract
volumes and there was evidence of some pha-
ryngeal movement we approximated the back
wall with a straight line and used its slope and
intercept.
Analysis
Manual pulse-by-pulse measurements of the
first three formants were performed in
Soundswell 4.00.30 (Hitech Development AB),
using the Spectrogram Tool with FFT-points of
128/512, bandwidth of 250 Hz, Hanning win-
dow of 8 ms. The synchronization between the
acoustic data and the x-ray film was done nu-
merically by finding the best line-up between
the time points of acoustic segment boundaries
and the synch pulses corresponding to the indi-
vidual images. The error of this procedure is
estimated at a few msec.
Lastly, a multiple regression analysis
(MRA) was performed (using Excel Analysis
Tool-pack) on the acoustic and the articulatory
data, F1 data serving as the dependent variable
and the several articulatory parameters as the
independent variables.
Results
In Fig.3 frequency measurements for the first
formant are shown plotted against jaw move-
ment data for all words, revealing a poor corre-
lation between the acoustics and the articula-
tion, thus seemingly confirming earlier findings
about the non-monotone nature of the articula-
tory-acoustic relationship.
Figure 3. F1 vs. JAW, all words.
Examination of single words is helpful for
understanding some of the sources of the poor
correlation.
Figure 4. Time variations for F1 and JAW in
/b(pk)/. F1 is seen to be independent of jaw po-
sition during the stop closure which has a lowering
effect on F1 but leaves the jaw trace at a more or
less fixed value.
In Fig.4 the time variations of F1 and JAW
are shown for /b(pk)/. In the occlusion
(bordered by red vertical lines), we find a clear
discrepancy between acoustics and articulation:
F1 takes a dive while the JAW shows hardly
any movement at all.
/b(pk)/
0
200
400
600
800
1000
0,000 0,100 0,200 0,300 0,400
time (sec)
F
1
(
H
z
)
/b(pk)/
0
2
4
6
8
10
0,000 0,100 0,200 0,300 0,400
time (sec)
J
A
W
(
m
m
)
0
200
400
600
800
0 2 4 6 8 10 12
Jaw vertical movements (mm)
F
1
(
H
z
)
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
0
200
400
600
800
0 200 400 600 800
from jaw
from all art.par's
2
0
200
400
600
800
0 200 400 600 800
from jaw
from all art.par's
/(lv)/
0
200
400
600
800
1000
0,000 0,100 0,200 0,300 0,400
time (sec)
F
1
(
H
z
)
/(lv)/
0
2
4
6
8
10
0,000 0,100 0,200 0,300 0,400
time (sec)
J
A
W
(
m
m
)
/da(s)/
0
200
400
600
800
1000
0,000 0,100 0,200 0,300 0,400
time (sec)
F
1
(
H
z
)
/da(s)/
0
2
4
6
8
10
0,000 0,100 0,200 0,300 0,400
time (sec)
J
A
W
(
m
m
)
Figure 5 Time variations for F1 and JAW (left col-
umn: /da(s)/; right column: /(lv)). Here is a
second example of F1:s independence of the jaw. In
both words the jaw is raised without any major ef-
fect on F1.
Fig.5 gives a closer view of the anticipation of
the final consonant. Here F1 stays put, while
the JAW exhibits a steep rise presumable in
preparation for the following consonants, /s/
and /l/ respectively. We know that jaw posi-
tions tend to be high compared to those of vo-
wels. In particular the articulation of /s/ de-
mands a high and steady jaw (Keating et al
1994).
Figure 6. Observed F1 vs. MRA-predicted F1
These examples help us understand why jaw
position alone is a poor predictor of F1 fre-
quency. They also suggest the use of MRA.
Fig.6 compares predictions based on the jaw
alone (gray symbols) with MRA results (solid
circles). The left diagram shows how MRA im-
proves the correlation score for b-words. The
improvement (from r
2
=0.17 to =0.73) is mainly
due to the fact that the drastic F1 lowering gets
linked to the lips reaching closure (cf Fig 4).
The right diagram illustrates the corresponding
results for g-words. Here the improvement
(from r
2
=0.31 to =0.83) occurs because, despite
the decrease in jaw opening, F1 can remain
high in the context of tongue blade elevation (cf
Fig 5). How well are formants predicted by the
present shortcut method? Too early to give a
final answer since our results are limited to pre-
liminary observations of F1. Suffice it to say
that with carefully motivated selection of arti-
culatory information further improvements
seem possible.
References
Branderud P, Lundberg J-J, Lander J, Djam-
shidpey H, Wneland I, Krull D &
Lindblom B (1998): X-ray analysis of
speech: Methodological aspects, Proceed-
ings of FONETIK 98 (Dept of Linguistics,
SU) 168-171.
Branderud P (2002): Papex software, Dept of
Linguistics, Stockholm University.
Cortes E. (forthc.) Mapping articulatory para-
meters on the formant pattern. Dept of Lin-
guistics, Stockholm University.
Ericsdotter C (2005): Articulatory-Acoustic Re-
lationships in Swedish Vowel Sounds, Ph D
diss, Department of Linguistics, Stockholm
University.
Fant G. (1960) Acoustic Theory of Speech Pro-
duction Mouton: Hague.
Heinz J. M & Stevens K N (1964): On the De-
rivation of Area Functions and Acoustic
Spectra from Cinradiographic Films of
Speech, 67th ASA meeting: 1037-1038.
Keating, P.A., Lindblom, B., Lubker, J., and
Kreiman, J. (1994) Variability in jaw
height for segments in English and Swedish
VCVs, J Phonetics 22:407-422.
Lindblom B & Sundberg J (1971): Acoustical
consequences of lip, tongue, jaw and larynx
movement, J Acoust Soc Am 50:1166-
1179.
Lindblom B (2003): A numerical model of
coarticulation based on a Principal Compo-
nent analysis of tongue shapes, Proc XV
th
ICPhS, Barcelona.
Stevens K N (1998): Acoustic Phonetics, MIT
Press.
Stark J, Lindblom B & Sundberg J (1996):
APEX an articulatory synthesis model for
experimental and computational studies of
speech production, TMH-QPSR, 37(2):45-
48.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
(Re)use of place features in voiced stop systems:
Role of phonetic constraints
Bjrn Lindblom
1
, Randy Diehl
3
, Sang-Hoon Park
3
and Giampiero Salvi
2
1
Dept of Linguistics, Stockholm University SE 10691 Stockholm
2
KTH, Dept of Speech Music and Hearing, SE 10044 Stockholm
3
Dept of Psychology, University of Texas at Austin, Austin, Texas 78712, USA
Abstract
Computational experiments focused on place of
articulation in voiced stops were designed to
generate optimal inventories of CV syllables
from a larger set of possible CV:s in the pres-
ence of independently and numerically defined
articulatory, perceptual and developmental
constraints. Across vowel contexts the most sa-
lient places were retroflex, palatal and uvular.
This was evident from acoustic measurements
and perceptual data. Simulation results using
the criterion of perceptual contrast alone failed
to produce systems with the typologically widely
attested set [b] [d] [g], whereas using articu-
latory cost as the sole criterion produced in-
ventories in which bilabial, dental/alveolar and
velar onsets formed the core. Neither perceptual
contrast, nor articulatory cost, (nor the two
combined), produced a consistent re-use of
place features (phonemic coding). Only sys-
tems constrained by target learning exhibited
a strong recombination of place features.
Introduction
The simulations were aimed at modeling the use
and re-use of place features in voiced stop in-
ventories. We addressed two issues: First what
explains the predominance of labial [b], den-
tal/alveolar [d] and velar [g] in the worlds lan-
guages (Maddieson 1984)? Second why do all
languages systematically re-use phonetic at-
tributes (Clements 2003)? In other words, why
are phonetic forms phonemically coded?
This work is an extension of Liljencrants &
Lindbloms model of vowel systems (1972).
The program developed by Giampiero Salvi
systematically selects subsets of CV sequences
from a larger universal inventory of CV syl-
lables. It evaluates all possible systems in terms
of an optimization criterion. This criterion
quantifies how distinctive the syllables in the
subset are (perceptual contrast), how difficult
they are to produce (articulatory cost) and
how difficult they are to learn (learning cost).
An optimal subset is identified as the one that
minimizes the sum of the subsets cost/contrast
scores. The articulatory cost metric was devel-
oped from bio-mechanical measures of the ar-
ticulatory representations of the syllables. De-
gree of perceptual contrast was defined on the
basis of experimental data on listener confusions
and distance judgments (Park 2007). Two in-
terpretations of the end state of phonetic learn-
ing were studied: The first assumes that the ob-
ject of phonetic learning is to acquire dynamic
units (gestures as in articulatory phonology));
The other hypothesizes that the learning of
phonetic movements involves the mastery of
timeless via points (targets) and the mecha-
nism of motor equivalence (ability to reach
motor goals from arbitrary initial conditions
(Lashley 1951)).
Place and perceptual contrast
To provide empirical and independent motiva-
tion for assumptions used in the simulations
several subprojects were undertaken.
Park (2007) investigated the problem of de-
fining perceptual contrast of voiced stops. A
phonetically trained speaker produced 35 CV
syllables in which the place of the stop was var-
ied in seven steps ranging from bilabial, dental,
alveolar, retroflex, palatal, velar to uvular. The
vowel was [i] [e] [a] [o] or [u].
Perceptual judgments were collected from
four groups of phonetically nave subjects
whose native languages are English, Hindi,
Korean and Mexican Spanish. There were five
subjects per language group. Their task was to
identify the syllables which were presented at
four signal/noise conditions (no noise, +5, 0, -5
dB). The effect of native language was weak or
absent which motivated the use of pooled data.
Confusion matrices tested negatively for re-
sponse bias effects. The matrix with the pooled
data was symmetrized using a method due to
Shepard (1972).
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
On the basis of acoustic measurements and
listener responses acoustic and perceptual dis-
tance matrices were derived. In a comparison
of several acoustic measures of distance with the
perceptual distances derived from the confu-
sions it was found that the acoustic variable with
the strongest predictive power was the for-
mant-based distances. Including also burst and
formant rate distances improved correlations
further (Park 2007).
Place and articulatory cost
From the viewpoint of biomechanics the ar-
ticulatory cost of moving between two arbitrary
configurations a and b, in a fixed time interval,
should be strongly related to the distance be-
tween them. This reasoning led us to quantify
the cost of a CV syllable as follows:
2 2
( , ) [ ( , )] [ ( , )] A C V dist rest C dist C V = + (1)
This formula says that A(C,V), the cost of a
given CV, is the sum of the consonant onsets
displacement from rest plus the vowel end-
points displacement from the onset. The use of
squared values is intended to reflect the physio-
logical fact that the relationship between muscle
length and muscle force is non-linear (cf
force-length diagrams). This measure was ap-
plied to articulatory representations of the 35
syllables.
Lacking direct articulatory measurements of
the recorded CV syllables, we used a subset of
about 500 tracings of X-ray profile images of a
single Swedish speaker available from other
projects (Branderud et al 1998). This corpus was
searched for representative vowels and strongly
constricted configurations at points of articula-
tion ranging from dental to uvular. Data on the
rest position were also included. It was defined
in terms of the articulatory configuration
adopted during quiet breathing. The final selec-
tion consisted of images of [i e a o u] sampled in
stressed syllables near the vowel midpoints and
configurations representing dental, alveolar,
retroflex, palatal, velar and uvular closures. To
facilitate comparisons between the contours,
they were resampled at 25 equidistant flesh-
points.
In applying Equation (1) to these represen-
tations articulatory distances, dist(rest,C) and
dist(C,V), were computed as the
root-mean-square of the inter-fleshpoint dis-
tances.
The main findings were: (i) The proposed
cost measure ranks places with respect to in-
creasing cost as follows: bilabial, dental, velar,
alveolar, palatal, uvular & retroflex; (ii) It
captures the notion of assimilation success-
fully pairing front vowels with anterior conso-
nant onsets and back vowels with posterior
consonant onsets. The first finding is related in
part to defining the cost measure as deviation
from rest, in part to identifying rest with the
articulatory settings of quiet breathing: a raised
jaw; closed lips; a fronted tongue creating a
more open posterior vocal tract facilitating
breathing through the nose. The second result is
linked to the use of Eq (1).
Although the cost measure is a first ap-
proximation, the predicted preferences show
good agreement with typological data on the use
of place in stops. The worlds languages use 17
target regions from lips to glottis (Ladefoged &
Maddieson 1996). Irrespective of inventory size,
nearly all (over 99%) of UPSIDs 317 languages
(Maddieson 1984) recruit three places of ar-
ticulation: bilabial, dental/alveolar and velar in
stops.
The findings are also in good agreement with
observations of infant speech production which
show a strong tendency for alveolar closures to
co-occur with front vowels, velar with back
vowels and labial with central vowels (Davis &
MacNeilage 1995).
Gestural or target-based control?
Is adult speech production gesture- or tar-
get-based? We argue that taking a stand on this
issue also implies taking a stand on what the end
state of phonetic learning is. Do children learn
gestures or targets? The simulations were set up
to reflect those two possibilities.
In recent times the traditional target theory
of speech has not gone unchallenged. Support
for dynamic gestures as basic units comes from
experimental data indicating that visual and
auditory systems are more sensitive to changing
stimulus patterns than to purely static ones.
There is also evidence from speech perception
experiments (Strange 1989).
The problem that such observations pose for
a target theory of speech is that, if perception
likes change, why assume that the control of
speech production is based on static targets?
Should not what a talker controls in production
be what the listener wants in perception?
We argue that the fact that dynamic proper-
ties of speech are important in perception should
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
not in any way rule out the possibility that
speakers might use a sparse representation of
speech movements. While rejecting a target
theory of speech perception appears justified,
dismissing a target theory of speech production
appears premature.
This conclusion becomes clear when the
formal definition of gesture is examined. The
standard reference is to the work by Saltzman &
Munhall (1989). Their task-dynamic model is
often described as using an input of gestural
primitives. However, the fact that gestures are
formally defined in terms of point attractors
reveals that the notion of target is actually part
of their technical definition.
Targets and phonetic learning
We conclude from these considerations that
there is significant support for assuming that
adult speech processes are target-based and are
set up to generate motor equivalent behavior. In
other words, within its work space, the system
generates the movement between A (an arbitrary
current location) and B (an arbitrary movement
goal) and does so for situations requiring new
compensatory motion paths.
If these processes are part of the adult
speakers phonetic competence they must
somehow be acquired by the learner. We sug-
gest (i) that in development targets are the re-
sidual (least action) end products of matching
the response characteristics the speech effectors
to the dynamics of the ambient speech; and (ii)
that the movement paths (transitions) between
targets are handled by the general mechanism of
motor equivalence.
These assumptions lead to the following
corollary: Once a target has been learned in
one context, it can immediately be re-used in
other contexts, since motor equivalence han-
dles the trajectory for the new context.
The above set of hypotheses will be referred
to as target learning. A form of gestural learn-
ing will also be included in the simulations. It
will be interpreted to mean acquiring gestures
holistically.
Computational experiments
We here consider the set of possible CV:s to
consist of 35 syllables although languages in
principle have an uncountable number to choose
from. The goal of the simulations is to investi-
gate subsets of the 35 CV items by ranking them
according to an optimality criterion.
Building on Liljencrants & Lindbloms
model we developed the criterion with the fol-
lowing components:
perceptual contrast D(S) is a global measure of
perceptual dissimilarity based on the pairwise
dissimilarity D(i,j) of any syllables s
i
and s
j
be-
longing to the system S;
articulatory cost A(i) is the cost of each syllable
s
i
belonging to S;
learnability is a measure of the effort required to
learn system S. It is based on the number of
consonant onsets w and vowel endpoints z that
the syllables belonging to system S share.
The criterion to optimize is:
(2)
1/D(S)
2
corresponds to the definition of contrast
used by Liljencrants & Lindblom. When con-
trast does not contribute this term is equal to 1.
Learnability r(i) can assume the forms: r(i) =1
for gestural learning, or r(i)= wz for target
learning.
The rationale for adding the r(i) term is as
follows: The childs attempt to imitate and
spontaneously use a given phonetic form comes
up against dealing with the articulatory com-
plexity of that form. As imitation attempts are
repeated sensory references are established.
When a given sensory experience is recorded it
automatically gets linked to a motor reference
(assuming that the learner has a neural mirror
system, that is, a perceptual/motor link). With
more practice this motor reference is strength-
ened. Accordingly during the course of the
learning a sort of copying takes places. How-
ever, it is copying only in a non-trivial sense
since some patterns are easier than others (read:
they differ in terms of articulatory cost A(i)). So
speed of acquisition is affected by that cost. The
gestural and target-based approaches influence
that speed in different ways. In the gestural
mode forms are acquired at a rate inversely re-
lated to articulatory complexity. More com-
plexity means more practice. Target learning
modifies that rule. Because targets once learned
in one context can be re-used in new contexts,
practice will modify the score of all syllables
containing that target. Target learning therefore
implies more rapid learning than gestural
learning. The term r(i) controls that speeding up
by measuring how many times a given system
re-uses a given target. By definition target in-
formation is stored independently of context.
When a given target is practiced all potential
( ) ( )
1
score S 1/ ( ) / ( )
N
i
D S A i r i
=
=
2
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
combinations using it will benefit. It appears
reasonable to assume that this would also be true
of flesh-and-blood phonetic learning. The key to
the re-use phenomenon is motor equivalence
and the context-free nature of targets.
Results
Across vowel contexts the most salient places
were retroflex, palatal and uvular. This was
evident from acoustic measurements and per-
ceptual data. Perceptual contrast alone reflected
that fact in failing to favor systems with the ty-
pologically widely attested set [b] [d] [g].
On the other hand, using articulatory cost as
the sole criterion produced inventories in which
bilabial, dental/alveolar and velar onsets formed
the core.
Neither perceptual contrast, nor articulatory
cost, (nor the two combined), produced a con-
sistent re-use of place features (phonemic cod-
ing). Only systems constrained by target
learning exhibited a strong recombination of
place features.
Implications
A comprehensive discussion of the present
findings is found in Lindblom et al (in press).
Our research supports explaining the typologi-
cal preference for labial, dental/alveolar and
velar in terms of a theoretically motivated
measure of ease of articulation. It further sug-
gests that phonemic coding may also have ar-
ticulatory origins (the interplay between discrete
motor target representations and motor equiva-
lence) to which languages have adapted during
the course of history. The results do not throw
doubts on perceptual contrast playing a role in
shaping sound systems. Rather they suggest
phonetic factors operating in interaction. Per-
haps the most novel aspect of the work is the fact
that phonetic explanations were proposed not
only for substantive aspects (place preferences)
but also for formal facts such as the recombina-
tion of place features (phonemic coding). Here
the remarks of Martinet (1968:483) come to
mind: In so far as such combinations are easy
to realize and to identify aurally, they should be
a definite asset for a system: for the same total
of phonemes, they require less articulations to
keep distinct; these articulations, being less
numerous, will be the more distinct; each of
them being more frequent in speech, speakers
will have more occasions to perceive and pro-
duce them, and they will get anchored sooner in
the speech of children.
References
Branderud P, Lundberg H-J , Lander J ,
Djamshidpey H, Wneland I, Krull D &
Lindblom B (1998). "X-ray analyses of
speech: Methodological aspects, in
FONETIK 1998, KTH, Stockholm.
Clements G N (2003): Feature economy in
sound systems, Phonology 20:287-333.
Davis B L & MacNeilage P F (1995): The ar-
ticulatory basis of babbling, J Speech Hear
Res 38:1199-1211.
Ladefoged P & Maddieson I (1996): The sounds
of the worlds languages, Oxford:Blackwell.
Lashley K (1951): The problem of serial order
in behavior, pp 112-136 in J effress L A
(ed): Cerebral mechanisms in behavior,
Wiley:New York.
Liljencrants J & Lindblom B (1972): Numeri-
cal simulation of vowel quality systems: The
role of perceptual contrast, Language
48:839-862.
Lindblom B, Diehl R, Park S-H & Salvi G (in
press): Sound systems are shaped by their
users: The recombination of phonetic sub-
stance, to appear in Nick Clements & Ra-
chid Ridouane (eds): Where do features
come from? The nature and sources of pho-
nological primitives.
Maddieson I (1984): Patterns of sound, Cam-
bridge:CUP.
Martinet A (1968): Phonetics and linguistic
evolution, in Malmberg B (ed): Manual of
phonetics, 464-487, Amster-
dam:North-Holland.
Park S-H (2007): Quantifying perceptual con-
trast: The dimension of place of articulation,
Ph D dissertation, University of Texas at
Austin.
Saltzman E L & Munhall K G (1989): A dy-
namical approach to gestural patterning in
speech production, Ecological Psychology
1(4):333-382.
Shepard R N (1972): Psychological represen-
tation of speech sounds, in P.B. Denes & E.
E. David J r. (eds.) Human communication: A
unified view, 67-113, New York,
McGraw-Hill.
Strange W (1989): Evolving theories of vowel
perception. J Acoust Soc Am 85(5):
2081-2087.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
On the Non-uniqueness of Acoustic-to-Articulatory
Mapping
Daniel Neiberg and G. Ananthakrishnan
Centre for Speech Technology, CSC, KTH (Royal Institute of technology), Stockholm, Sweden
Abstract
This paper studies the hypothesis that the
acoustic-to-articulatory mapping is non-
unique, statistically. The distributions of the
acoustic and articulatory spaces are obtained
by minimizing the BIC while fitting the data
into a GMM using the EM algorithm. The kur-
tosis is used to measure the non-Gaussianity of
the distributions and the Bhattacharya distance
is used to find the difference between distribu-
tions of the acoustic vectors producing non-
unique articulator configurations. It is found
that stop consonants and alveolar fricatives are
generally not only non-linear but also non-
unique, while dental fricatives are found to be
highly non-linear but fairly unique.
Introduction
The acoustic-to-articulatory (A-to-A) mapping,
also known as articulatory inversion has been
of special interest to speech researchers for
quite some time. It deals with estimating or re-
covering vocal tract shapes from the acoustics
of an utterance. It remains one of the funda-
mental problems in understanding speech pro-
duction. The inversion has several applications,
namely low bit-rate encoding, training visual
agents like talking heads, improving articula-
tory speech synthesis, and robust speech recog-
nition. Research in this topic has shown sub-
stantial progress in terms of using several ma-
chine learning techniques to minimize the error
between the true vocal tract shape and the
shape estimated using knowledge of the acous-
tics. Ouni et al. (2005) and Roweis (1997) have
performed inversion using code books and dy-
namic programming while Richmond (2006)
has done extensive research on performing the
mapping using mixture density Neural Network
(NN) regression. Hiroya (2004) and
Katsamanis et al. (2007) have used Hidden
Markov Models (HMM) with one or more
states per phoneme, in order to perform the in-
version. Toda et al. (2008) have used a Gaus-
sian mixture model (GMM) along with Maxi-
mum Likelihood Estimation (MLE) smoothing
for dynamic features. These methods have been
extremely successful at minimizing the average
root mean square error (RMSE) over all articu-
latory channels and maximizing the Pearsons
Correlation Coefficient (PCC) between the ac-
tual vocal tract configuration and the predicted
ones.
The errors in the mapping are often attrib-
uted to the non-uniqueness of the inversion
known as fibers in the articulatory space. By
non-uniqueness, it is meant that multiple vocal
tract configurations can produce almost the
same acoustic features. Early research pre-
sented some interesting evidence corroborating
non-uniqueness. Bite-block experiments
showed that the speakers were capable of pro-
ducing sounds perceptually close to the in-
tended sounds even though the jaw was fixed in
an unnatural position. As shown by Gay and
Lindblom (1981) the lossless tube models of
the vocal tract also indicate the possibility of
this non-uniqueness. Qin et al. (2007) per-
formed, possibly, the first empirical investiga-
tion into the non-uniqueness problem. They
quantized the acoustic space using the percep-
tual Itakura distance on LPC features. The ar-
ticulatory space was modeled using a non-
parametric Gaussian density kernel with a fixed
variance. For the phonemes where the articula-
tory distribution was found to be multi-modal,
the authors had concluded that non-uniqueness
existed. They found non-uniqueness for certain
phonemes like //, /l/ and /w/, while there
seemed to be a unique mapping for certain
other phonemes.
By definition, a mapping is said to be non-
unique if more than one articulatory position
can produce exactly the same acoustic features.
In real continuous speech, however, the possi-
bility of finding two data points with exactly
the same acoustic parameters is abysmally
poor. In order to simplify the problem, the
acoustic space is quantized. If two data points
within this quantization range fall sufficiently
apart in the articulatory space, then the map-
ping is said to be non-unique. However, a result
obtained in this manner can be quite mislead-
ing, since this kind of an effect could be caused
due to insufficient resolution of quantization
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
and data sparseness. In an attempt to provide
answers to these issues a new model based
paradigm is proposed, which tries to quantify
non-uniqueness. In the following sections, a
study of the nonlinearity and non-uniqueness of
the A-to-A mapping is presented.
Data
The experiments conducted use the Electro-
magnetic Articulography (EMA) data from the
MOCHA database (Online:
http://www.cstr.ed.ac.uk/research/projects/artic
/mocha.html, accessed on 23 J an. 2007.) for a
male and female speaker. The acoustic features
are (D =16) MFCC from the acoustic input and
the articulatory features are positions of the
EMA coils. The x- and y-positions of 7 coils
are available, which means that there are a total
of d =14 articulator channels.
Method
The proposed method is illustrated in Figure 1.
It is based on unsupervised clustering of the
data points in acoustic space (X) for each pho-
neme, which partition the acoustic data in dis-
tinct Gaussian clusters. Then the data points in
articulatory space (Y) corresponding to each
acoustic subspace (modeled by a Gaussian), are
clustered again using the same technique. If the
clustering in the articulatory space generates
multiple modes, then it is a sign of non-linear
mapping. If the data points corresponding to
different modes in the articulatory subspaces
are all sampled uniformly from the same Gaus-
sian in the acoustic subspace, then it is a sign of
non-unique mapping. The details of this algo-
rithm are given below.
The clustering procedure uses a model
based approach which fits the given data into a
GMM, in such a way that every Gaussian
represents a cluster in the acoustic space. This
is achieved using the Expectation Maximiza-
tion (EM) algorithm to obtain parameters that
are Maximum Likelihood (ML) estimates
(McLachlan, 2000). The Schwartz or Bayesian
Information Criterion (BIC), is minimized, in
order to find the optimum number of clusters in
the acoustic space. Every phoneme p in the cor-
pus is modeled by a Gaussian mixture model,
p
, containing K
p
clusters. K
p
is chosen by
minimizing BIC.
Non-linearity
For the data points belonging to the k
th
acoustic
Gaussian, X
k
p
,the corresponding articulatory
subspace is modeled by an optimal number of
Gaussians, R
k
p
, using the same method. The ar-
ticulatory vectors belonging to the r
th
such
Gaussian are given by Y
k
p
(r).
If there exists only one Gaussian in the ar-
ticulatory space for the k
th
Gaussian in the
acoustic space (i.e. R
k
p
=1), then the distribu-
tion of the articulatory vectors can be predicted
by performing a linear transform on the Gaus-
sian distribution of the acoustic vectors. How-
ever, if R
k
p
>1, then it means that this sort of a
linear transform cannot be performed. The less
normal the articulatory space is, the more non-
linear the mapping. The non-linearity of Y
k
p
is
calculated by using the Mardias multi-variate
kurtosis (Mardia, 1970) for goodness of fit to a
normal distribution. This measure NL (Non-
Linearity) is the proposed measure of non-
linearity. It takes the value of 0 for true Gaus-
sian distribution and a positive higher value for
a non-Gaussian distribution. It is important to
note here, that observing multi-modality in the
distribution of the articulatory features corre-
sponding to a single mode Gaussian of acoustic
features does not necessarily imply non-
uniqueness, but it necessarily implies non-
linearity. The authors want to stress this point
here, because it is easy to confuse the multi-
modality with non-uniqueness. A more strin-
gent measure is necessary to imply non-
uniqueness.
Acoustic Space Articulatory Space
1
2
3
X Y
Y
A
X
A
X
B
Y
B
Y
A
X
A&B
Y
B
Figure 1: Three hypothetical examples of sub-
space mapping. 1) The mapping is linear 2) The
mapping is non-linear and unique 3) The map-
ping is non-linear and non-unique.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
weight
)
Non-uniqueness
Consider the Gaussian acoustic space; X
k
p
.
X
k
p
(r) is a subset which corresponds to one
mode of the articulatory space, Y
k
p
(r). There are
two possibilities. The first possibility is that,
this subset does not have a Gaussian distribu-
tion. In such a case, it may be possible to find a
non-linear mapping between each of these sets
of data to the corresponding mode in the articu-
latory space. But if this part has the same Gaus-
sian distribution as the whole single Gaussian,
i.e., if the distributions (X
k
p
(r)) and (X
k
p
) have
exactly the same parameters, then it connotes
that the data points with exactly the same dis-
tribution in the acoustic space can actually pro-
duce articulatory features with different distri-
butions. This is the necessary and sufficient
condition to imply non-uniqueness. In order to
find out the similarity between the distributions
(X
k
p
(r)) and (X
k
p
), the Bhattacharya distance
is used. However, there is no accurate method
of calculating this distance for unknown distri-
butions. Non-parametric distribution estimates
would suffer from a data sparseness problem.
So, we use the Bhattacharya distance assuming
a Gaussian distribution but weigh it by the
Gaussianity of the data. Non-Gaussianity is de-
termined by the kurtosis of the data points. We
bias it so that it takes the value of 1 for a per-
fect Gaussian distribution and is higher for non-
Gaussians. Thus non-uniqueness, (NU
k
p
(r)), can
be defined as the inverse of the Bhattacharya
distance (D
bh
) weighed by the measure of its
Gaussian nature:
K
m
(.) denotes the multi-variate kurtosis of the
data points. Thus, NU is lower for clusters with
unique mapping.
Results
Figure 2 shows an example of non-uniqness for
the articulatory and acoustic subspaces for pho-
neme /t/. Data points belonging to different
clusters in the articulatory space seem to belong
to the same distribution in the acoustic space.
The most discriminating features in the acoustic
space are shown to the right, while the actual
articulatory positions are shown to the left. We
can see that, the same data points have different
distributions in the articulatory space and al-
most the same distribution in the acoustic
space. This is a sign of non-uniqueness. Figure
3 shows a plot of the articulatory and acoustic
sub-spaces for phoneme /l/. Data points from
the acoustic space on the whole seem to take a
Gaussian distribution. But they do not take a
Gaussian distribution in the articulatory space.
For every cluster in the articutory space, the
acoustic distribution is a non-Gaussian in the
acoustic space. This shows that the mapping is
non-linear, but still rather unique. Figure 4
shows a comparative study of the non-linearity
(NL) and non-uniqueness (NU) for a few pho-
nemes in the database, for the male and female
speaker. The mean NL
p
and NP
p
for phoneme p
are calculated by weighing them with the
factor of the respective Gaussians. We
can see that most consonants exhibit non-
uniqueness while most vowels have a small de-
gree of non-linearity. The stop consonants /t/,
/d/, /k/ and /p/ and fricatives like /s/, /z/ has a
higher degree of non-uniqueness. The fricatives
// and // are highly non-linear but rather
unique, while liquids such as /l/, // and // are
found to be rather non-linear, but unique by the
method used in this paper. While the non-
linearity and non-uniqueness of stop conso-
nants is expected due to the silence region, the
reason why alveolar fricatives show high non-
uniqueness could be because the EMA data
may not have adequate detail to measure the
exact location of the tongue tip. There are con-
siderable variations in the levels of non-
uniqueness and non-linearity shown for the
male and female speakers, but the trends are
Figure 2: Plot showing the articulatory and
acoustic subspaces for phoneme /t/.
1000
( ) ( ) ( )) ( ( ), ( 1 ) (
1
) (
r X X D r X K
r NU
k
p
k
p bh
k
p m
k
p
+
=
2000 1000 0 1000 2000 3000 4000 5000 6000 7000
3000
2500
2000
1500
1000
500
0
500
Articulator x position mm
A
r
t
ic
u
a
t
o
r
y
p
o
s
it
io
n
m
m
50 45 40 35 30 25 20
0
2
4
6
8
10
12
14
16
LDA dimension 1
L
D
A
d
im
e
n
s
io
n
2
cluster 1
cluster 2
Upper lip
Lower lip
Velum
Tounge
Jaw
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
in the articulatory and
the acoustic subspaces.
h. Studies with
methods than EMA, must be carried out to validate
the results obtained from this paper.
me
for Research of the European Commission, under
FET-Open contract no. 021324.
Ou
-
Ro
arning smooth maps to re-
.,
Ric
tic-
Hir
roduction Model.
Ka
ry Speech
Tod
pectrum using a Gaus-
-
Ga
ustic
.
Qin. C., Carreira-Perpinan. M. A (2007) An
rp.
McLachlan, G., and D. Peel (2000) Finite Mix-
Mardia, K. V. (1970) Measures of multivariate
skewnees and kurtosis with applications,
Biometrika, 57(3):519530.
more or less similar. Figure 3 and 4 show cor-
responding data points
Conclusions and future work
This work proposes a method to distinguish be-
tween non-linearity and non-uniqueness. It suggests
measures to quantify the same, and analyzes the
non-linearity and non-uniqueness of different pho-
nemes in the database for two speakers. Phonemes
such as /t/ and /p/ are found to be non-unique, while
other phonemes like // and // are found to be non-
linear, but rather unique, from our studies. Future
work can be directed in trying to estimate the best
possible non-linear estimators for clusters with high
nonlinearity and further constraints or allowances to
tackle the non-uniqueness of the mapping. Work
must be done on defining a non-uniqueness crite-
rion for any general distribution. Knowing which
articulators contribute more to the non-uniqueness
could be another direction for researc
1000
Acknowledgements
The authors acknowledge the financial support of
the Future and Emerging Technologies (FET) pro-
gramme within the Sixth Framework Program
References
ni, S. and Laprie, Y. (2005) Modeling the
articulatory space using a hypercube code
book for acoustic-to-articulatory inversion.
In J . Acoust. Soc. Am., 118(1):444460.
weis, S. (1997) Towards articulatory speech
recognition: Le
cover articulator information, Eurospeech
3:12271230.
hmond, K. (2006) A Trajectory Mixture
Density Network for the Acous
Articulatory Inversion Mapping. In Proc.
ICLSP., 577- 580, Pittsburgh.
oya. S. (2004) Estimation of Articulatory
Movements from Speech Acoustics using
an HMM-based Speech P
In Trans. IEEE on Speech and Audio Proc-
essing., 12(2):175185.
tsamanis. A., Papandreou. G., Maragos. P.
(2007) Audio-visual to Articulato
Inversion using HMMs. In Proc. Multime-
dia Signal Processing, 457460.
a. T., Black. A. W., Tokuda. K. (2008) Sta-
tistical mapping between articulatory move-
ments and acoustic s
sian mixture model. In J . Speech Communi
cation 50:215227.
y, T.; Lindblom, B.; Lubker, J . (1981) Pro-
duction of bite-block vowels: Aco
equivalence by selective compensation. In J
Acoust. Soc. Am. 69: 802810.
Emperical Investigation of the Nonunique-
ness in the Acoustic-to-Articulatory Map-
ping. In Proc. Interspeech, 7477, Antwe
ture Models. J ohnWiley and Sons, New
York.
Figure 3: Plot showing the articulatory and
acoustic subspaces for phoneme /l/.
2000 1000 0 1000 2000 3000 4000 5000 6000 7000
3500
3000
2500
2000
1500
1000
500
0
500
Aatriculator x position mm
A
r
t
i
c
u
l
a
t
o
r
y
p
o
s
i
t
i
o
n
m
m
Upper lip
Lower lip
Velum
Tounge
Jaw
15
10
5
0
5
10
40
15
35 30 25 20
LDA dimension 1
L
D
A
d
i
m
e
n
s
i
o
n
2
15
Figure 4: Graph showing the mean non-linearity
and non-uniqueness for selected phonemes in the
MOCHA database for the male and female
speaker.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Pronunciation in Swedish encyclopedias: phonetic
transcriptions and sound recordings
Michal Stenberg
Dept. of Linguistics and Phonetics, Centre for Languages and Literature, Lund University
Abstract
This paper presents work in progress, aimed at
a doctoral dissertation on displaying pronunci-
ation to the users of general encyclopedias,
particularly those in Swedish language. Vari-
ous types of phonetic notations are studied and
compared; also some pronunciation dictionar-
ies are taken into account. The problems of
finding an optimal way of presenting pronunci-
ation to users are scrutinized and discussed.
Furthermore, so-called audio pronunciations,
i.e. recordings of read words in digital encyclo-
pedias, are treated from several points of view.
Introduction
When consulting encyclopedias, getting hold of
the pronunciation of an entry can be quite a
tricky affair, because of the multitude of phon-
etic notations being used in various works.
Some of these, such as the renowned Encyc-
lopdia Britannica, published in the U.S.A., do
not submit pronunciation data at all.
mong Swedish general encyclopedias,
there is a long-established tradition of e!plain-
ing to users how to pronounce words esteemed
difficult. "ustomarily, this is done by way of
phonetic transcriptions# since the advent of dig-
ital media, also sound recordings, $audio pro-
nunciations%, have been made use of.
Scope and method of study
&his study, which is intended to lead to a 'h.(.
thesis, will focus on phonetic notation systems
used in encyclopedias, particularly in Swedish
language. &he systems will be compared with
each other and also with some notations used in
pronunciation dictionaries)not only Swedish
ones)and evaluated. So-called audio pronun-
ciations, which are becoming increasingly fre-
quent in digital reference works as a comple-
ment to phonetic transcriptions, will be studied
separately.
&he survey method is qualitative# by means
of questionnaires, a panel of encyclopedia users
will be consulted about their attitudes, e!pecta-
tions and preferences with regard to display of
pronunciation.
n important issue will be that of optimi*-
ing pronunciation transcriptions. +or someone
consulting a reference book, it takes a certain
effort to read up on the present transcription
system. &his effort ought to be in proportion to
the benefit users get from it.
,ther ma-or issues that will be handled are
pronunciation editors% evaluation of sources of
information, their choice of language varieties,
$lects%, to be transcribed.recorded and their de-
cisions about what phonetic features)at pro-
sodic as well as segmental level)should be
submitted in various types of works.
What pronunciation to display?
/n the editing process of a reference book, an
essential issue)from several aspects)is de-
ciding what kind of pronunciation transcrip-
tions should be based on. +or Swedish, a rough
consensus seems to e!ist, although some phon-
emes, like .r/, and its combinations with fol-
lowing dentals may give rise to controversies.
/n the Swedish language community, a
small one with a relatively high general level of
education, people are supposed to pronounce
many loanwords and foreign proper names in a
fairly source-language-like way.
+or e!ample, it would be stigmatizing to
pronounce pommes frites or Colgate as if they
were ordinary Swedish words: [pmsfrits],
[klgt]. Rather, the latter ought to be pro-
nounced in its conventional quasi-English way:
[klgt], and the former in the French way:
[pmfrit]. 0owever, in recent decades, the el-
lipsis pommes, with the low-prestige pronunci-
ation [pms], has emerged.
/n Sweden, a person with an academic edu-
cation not mastering at least one foreign lan-
guage can hardly be imagined. 'rior to the ab-
olition of the studentexamen (roughly compar-
able to the French baccalaurat) in 1968, at
least two foreign languages were compulsory in
secondary schools. Overall, there is a social
pressure that creates a demand for information
about correct pronunciation in reference
books and dictionaries.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Choice of lects of foreign languages;
native vs. adapted pronunciation
part from settling what variety 1accent2 of
Swedish should be used as a base for transcrip-
tions, the problem of handling languages with
more than one ma-or variety, e.g. English,
Spanish and 'ortuguese, are to be dealt with.
,n the one hand, native speakers of these lan-
guages normally use their own pronunciation
wherever they are and whatever is their topic.
,n the other hand, when presenting the name
of a person alive, it is usually a matter of cour-
tesy for encyclopedias to report his.her own
preferred pronunciation. /n general, imposing
to the bearer of a name a pronunciation totally
alien to him.her comes)like misspelling it)
close to being rude.
3ut as time goes by, frequently mentioned
names, even personal ones, usually undergo the
same pronunciation changes as loanwords do.
&his is the case for 3eethoven 1see &able 42.
5otably enough, 6.S. 3ach keeps his 7erman
pronunciation [bax] in Sweden)in spite of [x]
having no phonematic status in Swedish)but
has become [bak] in (enmark and [bk] or
[bx] in the 8.9. /n such cases, publishing in
the first place the adapted 1swedici*ed2 pronun-
ciation seems to be a good rule of thumb.
Table 1. IPA transcriptions of Beethoven as habitu-
ally pronounced in some languages. In the Danish
example, the apostrophe ( ) denotes std.
7erman [bet!"#$%&
Swedish [b't()*+]
Danish [bet!",-./$&
British English [b01t2(34*+]
American English [b01t)4*+]
French [b0t)*+], [b0t*+]
Russian [b1tx)*+]
What notation to use?
&he transcriptions in Swedish general reference
books published since the end of the 19th cen-
tury are of many different kinds: some printed
worksparticularly older onesemploy only
letters of the Swedish alphabet, others add a
few special signs, and still others resort to a
more or less extensive IPA notation, not seldom
modified in some respects. In major Swedish
encyclopedias there is a tendency over the last
century to approach regular IPA, although a re-
luctance to accept the IPA way of marking
stress seems to remain. This may be due to a
solid tradition among monolingual Swedish
glossaries etc. to use an acute accent () to sim-
ultaneously indicate primary stress and quantity
(of vowels, in the first place). The acute accent
is not merely used in bracketed transcriptions,
but also in entry headwords. Since in Swedish,
quantity and stress are linked together, and
vowel and consonant length are in complement-
ary distribution, this system is economical and
operational as far as purely Swedish pronunci-
ation is concerned. The accent sign is placed
after the letter(s) representing the long sound,
e.g. kajak, konjak, pollett, thus eliminating
the problem of syllabification. However, when
it comes to rendering pronunciation of more
genuinely foreign words, the system proves to
be less suitable.
Table 2. Phonetic transcriptions of Fontainebleau
in the Swedish encyclopedias Nordisk Familjebok
(NF), Svensk Uppslagsbok (SvU), Nationalencyklo-
pedin (NE), Bertmarks Respons (BR) and Bonniers
Lexikon (BL). Years of publication in parentheses.
5+ 14:;<=>>2 [f566t7+bl58]
NF (190426)
[f)+
g
t7+
9
b:;8&
SvU (194755)
[f5
<
t7+bl5=]
NE (198996) [#>t?$b:"@&
BR (19978) [#A6t?$b:"&
BL (19938) [#>t?$b:B&
Table 3. Phonetic transcriptions of Michelangelo
in the same works as in Table 2
5+ 14:;<=>>2 [mik0la=+CD0l)]
NF (190426) [mik0la=+CD0l5]
SvU (194755) [mikela+=j el]
NE (198996) [mik0la=+CE0l)]
BR (19978) [mik0la+CE0l)]
BL (19938) [mik0l?+CE0l)]
/n addition, (en Store (anske Encyklopdi
14>>@=ABB42 provides [mikelndlo]Cin its
notation based on the Dania systemCas an
/talian pronunciation but in the other case -ust
inserts an /' stress mark in the headwordD
Fontainebleau; seemingly, a certain familiar-
ity with +rench pronunciation is e!pected from
the users of this (anish AB-volume work.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Should prosodic features other than
stress be rendered?
'rosodic features other than stress, like accent 4
and A in Swedish and its equivalents in 5orwe-
gian, or the (anish stEd, are seldom rendered in
the notations of general reference books. &he
reason for this may be twofoldD the phenomena
mentioned are of minor importance for under-
standing an utterance, and their reali*ation and
geographical spread vary widely. 0owever, for
entries in Standard "hinese, it would be quite
feasible for a vast encyclopedia to supply the
four tones, as does for e!ample (uden us-
sprachewFrterbuch in its later editions.
ptimi!ing notations
(ue to lack of space, a single-volume sports
dictionary cannot go into detail about pronunci-
ation in the way a full-fledged encyclopedia
can. 5either is it likely that its users are willing
to address themselves to a complicated system
in order to e!plore the minute details of a
word%s pronunciation.
"ow narrow should a transcription be?
n encyclopedia, in contrast to a language dic-
tionary, might contain words from a great num-
ber of languages# for practical reasons, a com-
mon notation system must be used. /deally, this
should be capable of conveying a phonemic
rendering of all the languages. &his creates a
dilemmaD if the transcription system is made
too narrow in order to fulfil the needs of one
language, in others it will necessitate irrelevant
choices between allophones.
#ange of individual phonetic symbols
compromise solution of the problem men-
tioned would be to widen the range assigned to
each phonetic symbol and use the signs some-
what differently for transcribing different lan-
guages. &his requires some well-chosen e!-
amples in the introduction chapter but should
be a viable way of obtaining reasonably good
transcriptions of many of the original lan-
guages.
n alternative way would be to show re-
spelled pronunciations, in analogy with those
often found in U.S. reference works, even ex-
tensive ones. However, as mentioned above,
the linguistic situation in Sweden is quite dif-
ferent from that in the United States, where
strongly anglicized pronunciations of almost all
foreign words are widely accepted.
#espelled pronunciations
Interesting examples of respelled pronuncia-
tions are found in Olausson and Sangster
(2006) and its predecessor, BBC Pronunciation
dictionary of British names (1983). Here, the
respelling systems are more condensed than its
U.S. counterparts and are presented together
with IPA transcriptions. This allows for conve-
nient use by a wide range of people. The re-
spelled pronunciations convey a rather angli-
cized version; the IPA transcriptions, aimed pri-
marily at users familiar with foreign languages,
are more true to the languages of origin, though
still somewhat anglicized.
$udio pronunciations
Encyclopedias that are web-based or on "( or
(G( often offer users audio pronunciations
1sound recordings of read entry headwords2 as a
complement to phonetic transcriptions. &he
production of such recordings bring some of
the above issues to a head.
What languages to record?
/t often proves practically impossible to make
recordings in all languages figuring among the
headwords of an e!tensive encyclopedia. Either
a limited number of languages can be chosen
for recordings by native speakers, or)if adap-
ted, e.g. swedici*ed, pronunciation is used)a
lot of languages, though rarely all, can be
handled by one or two speakers.
"ow to choose spea%ers?
Whether native or domestic speakers are to be
used, selecting them is a delicate task. 3esides
linguistic skill and suitable voice, speaking
style and age have to be taken into account.
Even though a certain variation is desirable, the
speakers must not be too disparate.
Coaching of spea%ers
When recording in a studio, speakers reading
lists of words tend to use a prosody that reveals
that they are)e!actlyH)reading a list, disreg-
arding the fact that users will listen to each
word as an isolated one. "oaching by a trained
phonetician is advisable.
Conclusion
,ne of the main concerns of the editorial staff
of an encyclopedia or other reference book is
putting itself in the place of the notional users.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
&his applies not least to pronunciation editors
and others responsible for displaying pronunci-
ation. 0opefully, this survey)once completed
)will make for useful and easily accessible
pronunciation data to all those curious in search
for it.
#eferences
BBC Pronouncing dictionary of British names,
2nd edn. (1983), Pointon, G.E. (ed.). Ox-
ford: Oxford Univ. Press
Catford J.C. (1988) A practical introduction to
phonetics. Oxford: Oxford Univ. Press
Duden, Aussprachewrterbuch, 6th edn., re-
vised and updated (2005). Mannheim: Du-
denverlag
Garln C. (2003) Svenska sprknmndens ut-
talsordbok. Stockholm: Svenska sprk-
nmnden, Norstedts ordbok
International Phonetic Association (1999)
Handbook of the International Phonetic As-
sociation: guide to the use of the interna-
tional phonetic alphabet. Cambridge, U.K.:
Cambridge Univ. Press
Ladefoged P. and Maddieson I. (1996) The
sounds of the worlds languages. Oxford:
Blackwell
Lindblad P. (1980) Svenskans sje- och tje-lud i
ett allmnfonetiskt perspektiv. Ph.D. thesis.
Lund: C.W.K. Gleerup
Molbk 0ansen, '. 14>>B2 8dtaleordbog.
7yldendahls rEde ordbEgerD (ansk udtale.
"openhagenD 7yldendalske 3oghandel and
5ordisk +orlag .S
,lausson, I. and Sangster, ". 1ABB<2 ,!ford
33" 7uide to pronunciation. ,!fordD ,!-
ford 8niv. 'ress
Pullum G.K. and Ladusaw W.A. (1986 [2nd
edn. 1996]) The phonetic symbol guide.
Chicago: The Univ. of Chicago Press
Warnant L. (1994) La prononciation franaise
dans sa norme actuelle. Paris and Gem-
bloux, Belgium: Duculot
Wells J.C. (2008) Longman pronunciation dic-
tionary, 3rd edn. Harlow, U.K.: Pearson Ed-
ucation Ltd.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
EXPROS: Tools for exploratory experimentation with
prosody
Joakim Gustafson and Jens Edlund
Centre for Speech Technology, KTH Stockholm, Sweden
Abstract
This demo paper presents EXPROS, a toolkit for
experimentation with prosody in diphone
voices. Although prosodic features play an im-
portant role in human-human spoken dialogue,
they are largely unexploited in current spoken
dialogue systems. The toolkit contains tools for
a number of purposes: for example extraction
of prosodic features such as pitch, intensity and
duration for transplantation onto synthetic ut-
terances and creation of purpose-built custom-
ized MBROLA mini-voices.
Introduction
This demo paper presents EXPROS, a graphical
toolkit permitting us to experiment with pro-
sodic variation in diphone synthesis in an effi-
cient manner.
Prosodic features such as pitch, intensity
and duration play an important role for many of
the aspects of spoken dialogue that are central
to human-human dialogue. Still, to date they
are rarely exploited in human-computer dia-
logues. Examples of areas that would benefit
from the inclusion of more prosodic knowledge
include interaction control, the management of
turn-taking, interruptions, and backchannels;
attitude towards what is said, such as the signal-
ling of uncertainty or certainty; prominence,
such as contrastive focus and stress; and
grounding, as in brief feedback utterances for
verification and clarification.
On the perception side, there is a fair body
of research into these matters from the spoken
dialogue system point of view. Some of these
results have been taken as far as to implementa-
tion and experimentation in full-blown spoken
dialogue systems. On the production side, there
are fewer examples where our knowledge of
prosody has made it all the way to full-blown
systems. In current spoken dialogue systems,
pre-recorded prompts or unit selection synthesis
are often chosen because of their superior voice
quality. The drawback is that these techniques
make it difficult to vary prosody and to control
this variation in any detail, so few examples of
experimentation with such variations exist. One
of the few examples is Raux & Black (2003),
which also provides an overview of the topic.
There is a large body of studies of prosodic fea-
tures using re-synthesis with modified prosody
(using e.g. Praat) and with HMM synthesis, but
the results have proven difficult to implement
in real on-line systems.
Other synthesis methods formant synthe-
sis and diphone synthesis provide greater con-
trol over prosodic features. The relatively low
voice quality of formant synthesis makes it un-
suitable for many user studies, however, and
diphone synthesis suffers from the relatively
large cost of recording the required diphones, as
well as from less-than-perfect voice quality.
Before going into the functionality currently
built into the toolkit, lets discuss a few of its
applications. Our main reason to experiment
with prosodic variation is to make spoken di-
alogue systems that more closely mimic hu-
man-human dialogue, in order to better exploit
its strengths. This need not be the case for all
spoken dialogue system design, but it is our
motivation here. The following are three exam-
ples of increasing complexity of dialogue needs
that EXPROS aim to meet.
Interaction control
A key area where humans excel over current
spoken dialogue systems is interaction control,
the management of the flow of the dialogue, for
example turn-taking and interruptions. An oft-
mentioned problem is that of user barge-ins, but
we would also want our systems to be able to
deal with system barge-ins and self-interrupts in
a better manner. The dialogue excerpts in Ex-
ample 1 exemplify this. In order for a spoken
dialogue system to produce the behaviours
listed above, the systems processing in its en-
tirety needs to be incremental, as noted in Allen
et al. (2001) and Aist et al. (2006). Here, how-
ever, we are only concerned with being able to
control the rendering of the speech sounds suf-
ficiently to produce utterances like the ones in
Example 1.
Example 1: Three dialogue excerpts
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
U Whats the weather like in Stockton?
S The weather in Stockholm? Wait a mo* [*ment, Ill look it up]
U No, I said Stockton
U Any news on fashion /SIL/ in Tibet?
S OK, le* [*t me see what I can do]
S Ah, let me see what I can do
U Any news about Camden market?
S Let me see... no, theres no* [*thing new at the moment]
/fresh news arrive/
S Oh, wait, theres a fire in Camden!
The sounds of dialogue
In order to achieve this kind of dialogue, we
need to be able to test variations in perception
tests as well as in real human-computer dia-
logue situations. To do this, we need to be able
to record the required prompts with different
prosody, at the very least. In many cases, we
may want to record new diphones in the ex-
ample above, for example, we could record
P*_SIL diphones, that go from a phoneme P to
silence SIL abruptly, to make the interruptions
sound more realistic. Recording extra sets of
diphones for hypo- and hyper-articulated
speech may also be useful, as well as affective
speech, for example angry or despondent. Test-
ing out new voices can be very time consuming,
however, as a Swedish diphone voice typically
contains some 5000 diphones. This is far too
expensive for exploratory studies into the ef-
fects of prosodic and voice quality variations.
Instead we can create mini-voices voices with
few diphones, that are able to produce only a
limited number of utterances, but that are easy
to record and to modify.
Incremental development
Finally, pre-recorded prompts, unit selection
synthesis, and diphone synthesis all suffer from
the need to enrol the original speaker each time
the voice is to be extended or changed. A di-
phone voice production is furthermore often
created in one go, and rarely updated or
changed after its completion. We attempt to
make it possible for speakers who are not the
original speaker to do as many extensions as
possible particularly to record new prosodic
patterns, and also for the voice creation to be
done incrementally, by making it simple to add
new diphones and diphone sets when they are
needed.
Prompts and voices developed in EXPROS
can be used in perception tests, either of stand-
alone prompts or of re-synthesised dialogue ut-
terances, but most importantly they are intended
for use in interactive experiments, where the
pragmatics the actual effect prosodic variation
has on the interaction can be measured.
The EXPROS Toolkit
The toolkit uses the Snack sound toolkit
1
as its
backbone, and integrates functions from a
number of existing tools, such as the Mbrola
engine and database builder
2
, a PC-KIMMO
3
morphological dictionary, NALIGN forced
alignment (Sjlander & Heldner, 2004),
/nailon/ prosodic extraction and normalisation
(Edlund & Heldner, 2006), etc.
Text processing: Reading and management of
(prosodic) labels in the orthographic input.
These labels could be used to generate prosodic
patterns automatically, such as increased stress
or prolonged syllables.
Grapheme to phoneme conversion: The tool-
kit currently incorporates automatic transcrip-
tion using PC-KIMMO and a Swedish diction-
ary with transcribed morphs, an NALIGN
CART tree built on Centlex, a Swedish pronun-
ciation dictionary developed at the Centre of
Speech Technology, as well as a set of cooar-
ticulation rules (over word boundaries) built
1
http://www.speech.kth.se/snack/
2
http://tcts.fpms.ac.be/synthesis/mbrola.html
3
http://www.sil.org/pckimmo/
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
into NALIGN. In addition, user lexica can be
defined and used.
Automatic speech alignment: The toolkit uses
the forced aligner NALIGN to extract phone
start and end times from recordings.
Automatic prosody parameter extraction:
For prosodic analysis, the toolkit can currently
use the methods built into the Snack sound
toolkit (ESPS get_f0 and AMDF pitch extrac-
tion as well as power analysis, which can be
used to estimate spectral tilt). The normaliza-
tion methods built into /nailon/ are also avail-
able.
Modification of prosodic parameters: The
toolkit provides a number of methods for modi-
fication of prosodic parameter curves as well as
creation of new curves. These include direct
manipulation in a GUI, stylisation, normalisa-
tion and transformation to another speakers
speaking style, model generated prosodic
curves, and transplantation of curves from re-
cordings.
Diphone synthesis: The toolkit uses an ex-
tended MBROLA synthesis engine (Drioli et
al., 2005) which adds control of for example
gain, spectral tilt, shimmer and jitter to render
audio. Using a combination of the components
listed above, the toolkit also gives the possibil-
ity to automatically generate the data needed to
build new MBROLA diphone databases, and
some scripts to make on-the-fly modifications
to how the MBROLA engine select diphones.
Next steps
A number of experiments and investigations
using EXPROS are underway:
We will test the effects of transplanted
prosody on perceived synthesis quality. Pre-
liminary listening tests suggest that transplant-
ing durations, intensity and pitch from human
recordings onto the diphone synthesis makes
diphone voices sound considerably better as a
whole, which is promising. We also want to test
this in the context of the findings of Hjalmars-
son & Edlund (in press), where synthesised ut-
terances containing typical features of human-
human dialogue, such as filled pauses and repe-
titions, were investigated.
The EXPROS tool has recently been used to
improve the subjective ratings of a bad speaker,
by re-synthesising 30 seconds of speech with
increased pitch variation and speaking rate
(Strangert & Gustafson, submitted). We intend
to do more experiments with resynthesis in or-
der to explore the limits of what can be ex-
pressed by manipulating prosody alone.
Furthermore, the toolkit has proven valuable
for verifying the quality of automatic prosodic
analysis pitch and intensity extraction as well
as phone durations by listening to the original
recording and its resynthesis in parallel a
method inspired by Malfrere & Dutoit (1997).
Finally, we are in the process of running
tests were subjects use EXPROS to create new
versions of very brief feedback or clarification
utterances in order to change their meaning. We
have previously shown that monosyllabic words
can be understood as positive or negative
grounding on the perceptual or understanding
levels by manipulating their pitch contour
(Edlund et al., 2005), and using EXPROS, we
hope to be able to show the same for multisyl-
labic compound words.
Acknowledgements
Thanks to everyone who has put hard work on
developing the publically available tools that
are used in this toolkit. Special thanks to
Thierry Dutoit (MBROLA) and Kre Sjlander
(Snack/NALIGN). This work was supported by
the Swedish research council project #2006-
2172 (Vad gr tal till samtal/What makes
speech special) and MonAMI, an Integrated
Project under the ECs Sixth Framework
Program (IP-035147).
References
Aist, G., Allen, J. F., Campana, E., Galescu, L.,
Gmez Gallo, C. A., Stoness, S. C., Swift,
M., & Tanenhaus, M. (2006). Software Ar-
chitectures for Incremental Understanding
of Human Speech. In Proceedings of Inters-
peech (pp. 1922-1925). Pittsburgh PA,
USA.
Allen, J. F., Ferguson, G., & Stent, A. (2001).
An architecture for more realistic conversa-
tional systems. In Proceedings of the 6th in-
ternational conference on Intelligent user
interfaces (pp. 1-8).
Drioli, C., Tesser, F., Tisato, G., & Cosi, P.
(2005). Control of voice quality for emo-
tional speech synthesis. In Proceedings of
AISV 2004 (pp. 789-798). Padova, Italy.
Edlund, J., & Heldner, M. (2006). /nailon/ -
software for online analysis of prosody. In
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Proc of Interspeech 2006 ICSLP. Pittsburgh
PA, USA.
Edlund, J., House, D., & Skantze, G. (2005).
The effects of prosodic features on the in-
terpretation of clarification ellipses. In Pro-
ceedings of Interspeech 2005 (pp. 2389-
2392). Lisbon, Portugal.
Hjalmarsson, A., & Edlund, J. (in press). Hu-
man-likeness in utterance generation: ef-
fects of variability. To be published in Pro-
ceedings of the 4th IEEE Workshop on Per-
ception and Interactive Technologies for
Speech-Based Systems. Kloster Irsee, Ger-
many.
Malfrere, F., & Dutoit, T. (1997). Speech Syn-
thesis for Text-to-Speech Alignment and
Prosodic Feature Extraction. In Speech Syn-
thesis for Text-to-Speech Alignment and
Prosodic Feature Extraction", F.
Malfr�re & T. Dutoit, Proceedings
of the International Symposium on Circuits
and Systems (pp. 2637-2640,).
Raux, A., & Black, .. (2003). A Unit Selection
Approach to F0 Modeling and its Applica-
tion to Emphasis. In Proceedings of ASRU
2003, St Thomas, US Virgin Islands..
Sjlander, K., & Heldner, M. (2004). Word
level precision of the NALIGN automatic
segmentation algorithm. In Proc of The
XVIIth Swedish Phonetics Conference, Fo-
netik 2004 (pp. 116-119). Stockholm Uni-
versity.
Strangert, E., & Gustafson, J. (submitted). Sub-
ject ratings, acoustic measurements and syn-
thesis of good-speaker characteristics. Sub-
mitted to Proceedings of Interspeech 2008.
Brisbane, Australia.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Presenting in English or Swedish:
Differences in speaking rate
Rebecca Hincks
Department of Speech, Music and Hearing
KTH
Abstract
This paper attempts to quantify differences in
speaking rates in first and second languages, in
the context of the growth of English as a lingua
franca, where more L2 speakers than ever be-
fore are using English to perform tasks in their
working environments. One such task is the
oral presentation. The subjects in this study
were fourteen fluent English second language
speakers who held the same oral presentation
twice, once in English and once in their native
Swedish. The temporal variables of phrase
length (mean length of runs in syllables) and
speaking rate in syllables per second were cal-
culated for each language. Speaking rate was
found to be 23% slower when using the second
language, and phrase length was found to be
24% shorter.
Introduction
As English continues its growth as a lingua
franca, more and more speakers across the
world find themselves in front of an audience
that needs to hear the speakers message in a
language that neither speaker nor listener is en-
tirely comfortable with. One reason for the dis-
comfort can be traced to the extra time it takes
to formulate ones message in a second lan-
guage (L2). Slower English speakers in busi-
ness meetings can have difficulty taking the
floor from native speakers (Rogerson-Revell,
2007) and international students may be frus-
trated by their ability to formulate responses
quickly enough to contribute to classroom dis-
cussion (J ones, 1999). Though researchers have
begun to explore the effect of L2 language use
in interactive situations such as the meeting or
the seminar, the ramifications of slower L2
speaking rate when holding an instructional
monologue, such as a presentation or a lecture,
have not been explored.
Conveying information to an audience in an
L2 can be a difficult experience for many rea-
sons. Teachers complain that they are less able
to be spontaneous, but they may not understand
themselves that they require a bit more time to
produce the same linguistic content. In general,
little is known about how the use of a second
language affects vital issues such as the
speakers ability to engage the audience and to
adequately cover the intended content in the
time allotted for the presentation or lecture.
Temporal featuresparticularly speaking
ratecan have an influence on both abilities.
Temporal variables have previously been
explored from the L1 perspective, the L2 per-
spective, and various interfaces between them.
The work that has been done has been carried
out in an attempt to study the cognitive proc-
esses underlying linguistic production (Gold-
man-Eisler, 1968), to understand language ty-
pology (Grosjean & Deschamps, 1973), to sup-
port a theoretical model for the process of sec-
ond language acquisition (Towell, Hawkins, &
Bazergui, 1996) or for tools in language as-
sessment (Rekart & Dunkel, 1992). The pre-
sent study is motivated by other needs that
could be described as pragmatic rather than
theoretical. We are now in a situation, at least
in Europe, where more speakers than ever be-
fore are carrying out their daily business in a
second language, English. The fact that speak-
ers speak more slowly in a second language
may be obvious but it is not trivial in the glob-
alizing world. The question asked here is sim-
ply how much speakers can be slowed down by
working in a second language.
This research builds on earlier work
(Hincks, 2005a and 2005b) which examined a
smaller database of five speakers making dual
lingual presentations. Those five speakers form
part of this study as well, but their recordings
have been augmented with nine new speakers
to create a more reliable subject group. The
first study looked at two primary variables:
speaking rate and pitch variation. The present
study omits pitch variation, saving that aspect
for a future study. The 2005 results showed
large differences in speaking rate, which have
been confirmed by testing on a larger group.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Method
Working at the syllable rather than word level
is a necessity for any kind of cross-linguistic
study; although Swedish and English are
closely related languages, they use different
orthographic conventions. An assumption is
made that the information content of syllables
is equivalent when comparing genetically re-
lated languages such as English and Swedish.
This study uses the second rather than minute
as the length of time, and speaking rate (SR) is
thus expressed in syllables per second (sps).
Another variable that has been found to be
relevant in the study of speaking rate is what is
known as the mean length of runs (MLR)
what could also be called phrase length, or the
amount of speech, in syllables, between pauses.
The MLR will generally be shorter in L2
speech than in L1 speech (Kormos and Dnes
2004), and in that way give an indication of the
frequency of pauses in the speech. Different
pause lengths have been used to define the
boundaries of the phrases, but most studies
have used a length between 200 and 300 milli-
seconds. This study uses a length of 250ms, or
one quarter of a second.
The fourteen subjects for this study, six
women and eight men, were all Swedish native
students of engineering at KTH, taking an elec-
tive course in Technical English. They had
taken a written diagnostic test upon application
to the language department, and had been
placed in either the Upper Intermediate (B2+)
(10 subjects) or Advanced classes (C1) (4 sub-
jects). The English oral presentations were re-
corded in the second half of the 56-hour
courses, so that students had had plenty of time
to warm up any rusty spoken English. The
Swedish oral presentations were made outside
of class, using the same visual material and be-
fore a smaller audience.
The 28 presentations were carefully tran-
scribed in a three-step process. First the entire
presentation was orthographically transcribed,
including filled pauses. Speech recognition was
a helpful tool in the English transcriptions. The
speaker-dependent dictation software Dragon
NatSpeak 9 was trained to the researchers
voice, who then repeated the presentations into
the dictation program. A complete, though
somewhat inaccurate, transcription could be
produced in real time10 minutes for a 10-
minute presentation. Listening to the presenta-
tion two or three more times allowed for cor-
rection of the inaccuracies and addition of the
filled pauses that the speech recognition is
trained to ignore. The vocabulary of the dicta-
tion software was impressive, including Swed-
ish place names and rare words such as types of
pharmaceuticals and phenomena (e.g. quantum
teleportation). Swedish personal names were,
however, more problematic.
The second phase of transcription, which al-
lowed further correction to any eventual inac-
curacies, was to break the transcriptions into
phrases, using pauses as boundaries. The
speech waveform was used to locate all silent
or filled pauses longer than 250 milliseconds.
Finally, in the third phase of transcription,
each phrase was broken into syllables by insert-
ing spaces to represent syllable boundaries.
Filled pause markings were first removed so
that they would not be counted as syllables.
The total number of syllables was divided by
the length of the presentation in seconds to find
the speaking rate.
Results
Table 1 presents the mean length of runs, the
total number of syllables, the total speaking
time, and speaking rate.
Phrase length (MLR)
All speakers used shorter phrase lengths in
English than in Swedish. The means were
12.59 syllables per phrase in L1 and 9.51 sylla-
bles per phrase in L2, a mean difference of
3.08, SD 2.15. This shorter length in L2 is sta-
tistically significant: t (13) =3.10, p <.01,
one-tailed. The phrase lengths by speaker cor-
relate strongly between languages: R=0.82.
Speaking rate
The mean SR for L1 was 3.89 sps (SD .61), and
in L2 3.12 sps (SD .46). The slower speaking
rate in L2 is statistically significant: t (13) =
3.438, p <.01, two tailed. This can also be ex-
pressed as a mean difference of 20.8%, where
L2 is 23% slower than L1, and L1 is 18.7%
faster than L2. As expected, all speakers spoke
more quickly in L1: at least 3 sps, with three
speakers approaching a speaking rate of 5 sps.
In L2 the rates range from a low of 2.37 sps to
a high of 4.12 sps. The SRs between languages
correlate strongly, R=0.85. They also correlate
by speaker with phrase length: 0.82 for L1, and
0.89 for L2.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Table 1. The mean length of runs between pauses of >250 ms, the total number of syllables in the
presentation, the total seconds of speech, and the speaking rate in syllables per second.
Speaker
Mean length of
runs
(syllables)
Total Syllables
Total Time
(seconds)
Speaking Rate
(syll/s)
Swedish
L1
English
L2
Swedish
L1
English
L2
Swedish
L1
English
L2
Swedish
L1
English
L2
S1(M) 11.57 6.40 2244 2272 588 836 3.82 2.72
S2(M) 10.98 6.94 1383 1534 439 648 3.15 2.37
S3(M) 8.94 7.09 1225 1318 370 475 3.31 2.77
S4(F) 10.55 8.00 1889 1568 535 568 3.53 2.76
S5(M) 11.49 8.23 2367 1844 545 576 4.34 3.20
S6(F) 8.84 8.45 1538 1369 477 472 3.22 2.90
S7(M) 9.27 9.14 1660 1773 506 597 3.28 2.97
S8(M) 11.14 9.23 1437 1938 355 580 4.05 3.34
S9(M) 11.37 9.87 2411 1797 602 521 4.00 3.45
S10(M) 14.79 10.36 3934 2487 840 700 4.68 3.55
S11(F) 14.26 10.99 2737 2571 676 752 4.05 3.42
S12(F) 12.96 11.16 1361 1296 411 425 3.31 3.05
S13(F) 20.36 12 3502 2519 722 696 4.85 3.62
S14(F) 19.73 15.34 2229 2025 464 491 4.80 4.12
SD 3.62 2.38 820 446 138 119 .61 .46
Mean 12.59 9.51 2137 1879 538 596 3.89 3.12
Discussion
Both similarities and differences between the
L1 and L2 presentations have been revealed by
this examination of two temporal variables: the
amount of speech uttered between pauses
(MLR), and the speaking rate (SR), including
pauses, over 6-14 minutes. To begin with the
similarities, there is a strong effect of individ-
ual speaking style between the two languages.
The correlations between L1 and L2 of 0.82
(SR) and 0.85 (MLR) show that those speakers
who used shorter phrase lengths and slower
rates of speech in one language used them in
the other language as well, confirming previous
work done on laboratory speech (Deschamps,
1980; Raupach, 1980; Towell, Hawkins, &
Bazergui, 1996). Though other researchers
have suggested using phrase length to measure
fluency in second languages, it is important to
recognize that phrase length differs in ones
first language as well.
The main research issue addressed here was
an attempt to quantify the effect on speaking
rate of using an L2 in the oral presentation
situation. Using English instead of their native
language meant that all speakers had shorter
phrase lengths and slower rates of speech. On
average, using English slowed the speakers
down by 23%. The difference can be attributed
to the frequent short pausesas evidenced by
the shorter phrase lengthsthat are necessary
for the speakers to find the formulations they
need in L2. A long phrase length shows that
that linguistic knowledge has been procedural-
ized (Levelt, 1989; Towell, Hawkins, & Bazer-
gui, 1996). The subjects in this study, though
they were speaking about material they them-
selves had prepared and were fluent speakers of
English, show the degree to which operating in
a second language affects the cognitive proc-
esses underlying speech production.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Conclusion
Recommendations
The slower speaking rates shown in this study
do not necessarily pose a problem when the
speech in question is instructional speech.
When both speakers and listeners are operating
in a second language, a speaking rate of about 3
sps is probably appropriate. However, it is im-
portant for individual speakers and for policy-
makers to understand and acknowledge the ef-
fect of using a second language on speaking
rate, particularly when making a shift from do-
ing a task one normally does in L1 to doing it
in L2. If the rate of a delivery of a 45-minute
lecture is slowed down by 25%, then the lecture
will take closer to an hour to finish. Course
plans and schedules need to be adapted to ac-
commodate this, especially in light of the fact
that research has shown that students tend to
save their questions for after class when they
are themselves operating in an L2 (Airey &
Linder, 2006). Other measures that could be
considered would include variable speaker time
at conferences and other gatherings.
Further work
The next question to be asked in the study of
the dual-language presentation database is to
what extent using different languages affected
the content of the presentations. Is the slower
speaking rate a symptom of such linguistic dif-
ficulty that speakers omit information in L2
that they include in L1? It is beyond the scope
of the present paper to investigate this question
in detail, but it can be said that an initial study
comparing the propositional content of the
fourteen pairs of presentations finds a slight but
not overwhelming advantage for the L1, espe-
cially when the presentations are normalized
for time. Further differences appear in the
meta-discourse with which the speakers struc-
ture their presentations, and the extent to which
they elaborate on the content. These issues will
be the subject of forthcoming work.
References
Airey, J .& Linder, C. (2006). Language and the
experience of learning university physics in
Sweden. European J ournal of Physics 27,
553-560.
Hincks, R. (2005a). Presenting in English and
Swedish. Proceedings of Fonetik 2005
(Gothenburg University Department of Lin-
guistics) 45-48.
Hincks, R. (2005b). Computer Support for
Learners of Spoken English. Doctoral The-
sis. Royal Institute of Technology Stock-
holm.
Deschamps, A. (1980). The syntactical distribu-
tion of pauses in English spoken as a second
language by French students. In Temporal
Variables in Speech (pp. 255-262): Mouton.
Goldman-Eisler, F. (1968). Psycholinguistics.
Experiments in Spontaneous Speech. Lon-
don: Academic Press.
Grosjean, F., & Deschamps, A. ( 1973). Ana-
lyse des variables temporelles du francais
spontan II. Comparison du francais oral
dans la description avec langlais (descrip-
tion) et avec le francais (interview radio-
phonique). Phonetica 28, 191-226.
J ones, J . (1999). From Silence to Talk: Cross-
Cultural Ideas on Students' Participation in
Academic Group Discussion. English for
Specific Purposes, 18(3), 243-259.
Kormos, J ., & Dnes, M. (2004). Exploring
measures and perceptions of fluency in the
speech of second language learners. System,
32, 145-164.
Levelt, W. (1989). Speaking: from Intention to
Articulation. Cambridge: MIT Press.
Raupach, M. (1980). Temporal variables in first
and second language speech production. In
Temporal Variables in Speech (pp. 263-
270): Mouton.
Rekart, D., & Dunkel, P. (1992). The Utility of
Objective (Computer) Measures of the Flu-
ency of English as a Second Language. Ap-
plied Language Learning, 3, 65-85.
Rogerson-Revell, P. (2007). Using English for
International Business: A European case
study. English for Specific Purposes, 26,
103-120.
Towell, R., Hawkins, R., & Bazergui, N.
(1996). The Development of Fluency in
Advanced Learners of French. Applied Lin-
guistics, 17(1), 84-119.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Preaspiration and Perceived Vowel Duration in Norwegian
Jacques Koreman
1
, William J. Barry
2
and Marte Kristine Lindseth
1
1
Department of Language and Communication Studies, NTNU, Trondheim
2
Institute of Phonetics, Saarland University, Saarbrcken
Abstract
This article presents an experiment to investi-
gate the perceived duration of Norwegian vow-
els before [d] versus preaspirated [t]. It is
shown that preaspiration contributes to the
perceived duration of the vowel before [t]. The
general implications of this finding for phone
segmentation and for phonetic research using
vowel duration measurements are discussed.
Introduction
In a production study of American English,
Peterson and Lehiste (1960) found that vowel
duration including aspiration after phonologi-
cally voiceless or fortis plosives (308 ms) is
longer than after phonologically voiced or lenis
word-initial plosives (274 ms); if aspiration is
excluded, it is shorter (251 ms) on average in a
set of 68 minimal pairs. In a perception study
for German (results as yet unpublished), a cor-
responding effect on perceived duration was
shown: the vowel duration after a lenis plosive
was judged equal to the duration of the vowel
plus half of the aspiration after a fortis plosive.
Comparable to the production data above,
the vowel incl. preaspiration before a tense plo-
sive is longer than the vowel before a lax plo-
sive in preaspirating dialects of Norwegian
(Van Dommelen, 1999), while the vowel ex-
cluding preaspiration is shorter; but there is
substantial variation across dialects. In percep-
tion, the perceptual effect of preaspiration on
phonological categorization has been investi-
gated by Moxness (1997), who found that the
perceived (phonological) vowel length is not
affected by the presence or absence of preaspi-
ration in V:C versus VC: stimuli. Van Dom-
melen (1998) showed an effect of preaspiration
on the perception of plosives as fortis versus
lenis in Norwegian.
The present study investigates the perceived
phonetic vowel duration in stimuli containing
preaspirated versus fully voiced vowels. We
hypothesize that, similar to aspiration, preaspi-
ration influences the perceived vowel duration.
More specifically, our goal is to evaluate how
much of the pre-aspiration is perceived as part
of the vowel.
Method
We shall first discuss the selection of the stimu-
lus pairs and then describe their preparation for
the perception experiment, followed by a de-
scription of the experiment itself.
Selection of the stimuli
Ten repetitions of two sets of /CVC/ and
/CVC/ stimuli were recorded with all combi-
nations of /i,,u/ followed by /p,b; t,d; k,/ (and
with the same initial C in each minimal pair
differing in [voice] for the second consonant).
The stimuli were presented for reading in ran-
dom order on a computer screen using a
PowerPoint presentation, and were recorded
directly onto hard-disk in a studio, with a sam-
pling frequency of 44 kHz and a 16-bit ampli-
tude resolution. Comparison of the stimuli
showed that preaspiration after short vowels is
generally longer than after long vowels. The
vowel // showed no supraglottal friction,
which did sometimes occur with close vowels,
especially /i/. To maximize the presence of true
preaspiration, we selected batte-badde
from the list as the single stimulus pair for our
perception experiment. An additional consid-
eration was that these are both nonsense words,
so that the listeners are not affected by familiar-
ity or frequency of the stimuli.
Figure 1. Segmentation of vowel in batte into
modally voiced, breathy and preaspirated portions
To investigate how perceived vowel dura-
tion is affected by preaspiration, we used ten
repetitions of the stimuli spoken by a single,
male speaker from Stavanger. Figure 1 shows
an example of the segmentation of the vowel in
batte into modally voiced, breathy voiced and
modal br preasp
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
preaspirated portions, for which praat was
used. Since there was typically a breathy
voiced signal portion in the transition from mo-
dal voice to preaspiration, this was included as
a separate factor in the perception experiment.
Preparation of the stimuli
Two native Norwegian listeners (both MA stu-
dents of Phonetics) could not distinguish the
signal portions of the two sets of stimulus
words from the release to the end of the word.
Thus, batte and badde are differentiated by
the stressed vowel and the following closure.
The vowel of the nonsense words of the
batte type all consisted of modal voicing, fol-
lowed by breathy voicing and preaspiration.
The vowel in badde consisted entirely of mo-
dal voicing which continued into the closure of
the following /d/. The stimuli for the listening
test were adapted so that the durations of the
vocalic portions were carefully controlled:
Figure 2. Schematic diagram of the vowel du-
rations in six stimulus conditions (time axes in
the real stimuli are not comparable across
stimulus conditions)
Stimulus_0: the duration of the vowel in
badde equals that of modally voiced +
breathy voiced + preasipirated portions of the
vowel in batte.
All stimulus pairs were selected such (from
the ten repetitions) that the two vowels in a pair
were similar. If this was not possible, the two
stimuli were changed by deleting or adding sin-
gle glottal periods until the vocalic portions of
interest had more or less the same durations.
The following stimuli have increasingly
longer vowels (from vowel onset until the clo-
sure of the following consonant) in batte than
in badde:
Stimulus 1: the duration of the vowel in
badde equals that of the modally voiced +
breathy voiced + half of the preaspirated por-
tion in batte.
Stimulus 2: the duration of the vowel in
badde equals that of the modally voiced +
breathy voiced portions in batte.
Stimulus 3: the duration of the vowel in
badde equals that of the modally voiced por-
tion in batte.
Notice that in the last three stimuli, the
batte stimulus word will be longer than
badde, namely by the other half of the
preaspirated portion (which is still present!) in
stimulus 1, the whole preaspirated portion in
stimulus 2, and the breathy voiced +preaspi-
rated portions in stimulus 3.
In addition to the above three conditions
there are two conditions in which the vowel in
batte is relatively shorter:
Stimulus -1: the duration of the vowel in
badde is longer than the total vowel duration
in batte by the duration of the breathy voiced
portion in batte (but the vowel is modally
voiced throughout).
Stimulus -2: the duration of the vowel in
badde is longer than the total vowel duration
in batte by the duration of the breathy voiced
+ half of the preaspirated portion in batte
(but the vowel is modally voiced throughout).
We did not include a condition stimulus -3
(same as stimulus -2, but with an even longer
vowel in badde) because the vowel in that
case was always perceived as longer than the
total vowel in batte in preliminary listening.
Inclusion would have increased the number of
stimuli, without giving additional information.
Two stimulus sets were prepared: in set A, a
stimulus pair was selected for each condition
from the ten different realizations of batte
and badde which fulfilled its vowel duration
criteria (cf. Figure 2). The advantage of this
stimulus set is that the stimuli were produced
naturally, i.e. the (majority of the) stimuli did
not need to be manipulated (by dropping or
copying glottal periods) to obtain the vowel du-
rations according to the scheme in Figure 2.
The disadvantage of those stimuli is that in ad-
dition to the durational differences there may
be other factors which influence the perception
of vowel duration.
For this reason, we also created a set B, in
which only one stimulus pair was selected as a
basis for the perception experiment. The pair
was chosen so as to fulfill condition 0, i.e. the
vowel in badde had the same duration as the
total vowel duration in batte. To derive the
Stimulus 0
Stimulus 1
Stimulus 2
Stimulus 3
Stimulus -2
Stimulus -1
badde
batte
badde
batte
badde
batte
badde
batte
badde
batte
badde
batte
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
other conditions, the stimuli were manipulated
by inserting or deleting glottal periods from the
modally voiced portion of the vowel. The
batte and badde stimuli were manipulated
equally strongly (e.g. two glottal periods in-
serted into badde and two dropped from
batte for condition -1).
Perception experiment
Eight listeners listened to stimulus set A. The
same listeners, plus another four, listened to
stimulus set B. The listeners were all native
Norwegians between 25 and 60 years old, and
had no reported hearing problems.
The listeners task was to judge which of
the two stimuli in each pair was longest and re-
spond by checking the corresponding box on a
response form (on paper) to indicate their per-
ception. There was no equal duration choice,
since we wanted to prevent the listeners from
using this option too often. The listeners could
hear the stimulus pair over headphones as often
as they liked by moving the mouse over a loud-
speaker symbol in the PowerPoint presentation.
Mouse Over was used instead of Mouse
Click to prevent the disturbing sound of
mouse clicks.
The batte-badde pairs were offered in
both orders, balanced across ten lists (repeti-
tions). Within each list, the stimulus pairs were
offered 10 times in different pseudo-
randomized order. The pseudorandomization
consisted in ensuring that two consecutive
stimuli were more than 2 apart, i.e. Stimulus 1
could not be followed by Stimulus 0 or 2. The
total number of stimuli was 6 conditions x 2
orders x 10 repetitions =120 stimuli. Each list
was preceded by five and followed by three
filler items.
Results
The two versions of the experiment using
stimulus sets A and B led to substantially dif-
ferent results (see Figure 3). In general, the
upward trend in longer vowel responses from
Stimulus -2 to Stimulus 3 corroborates our hy-
pothesis. But for set A, the trend is weak and
the number of responses where the vowel in
batte is considered longer than that in
badde never exceeds 50%. That is, the vow-
els in the badde stimuli are always judged
longer than those in batte.
For set B, where the stimuli were all de-
rived from a single pair of stimuli, the trend is
much clearer, and shows a clear transition from
3% longer vowel responses for Stimulus -2 to
almost 94% for Stimulus 3.
Figure 3. Response percentage longer
vowel in batte for stimulus sets A and B
Differences between stimulus conditions
Separate analyses of variance for the two
stimulus sets showed that order of the stimuli
within the pair had no significant effect on the
perceived relative vowel duration in the two
stimuli, nor did order interact with stimulus
condition. The difference between the six
stimulus conditions, however, was highly sig-
nificant for both stimulus sets (set A:
F(0.001;5,84)=10,21; set B:
F(0.001;5,132)=146.83).
For set A, Scheffs post-hoc tests resulted
in three homogeneous subgroups (-1,-2,1,0 <
0,3 <3,2), where the middle subgroup is almost
significant at 5%. The tendency is therefore
that the perceived vowel duration in stimulus
conditions 2 and 3 differs from the other condi-
tions. But the vowel in batte is mostly per-
ceived as shorter than that in badde, as noted
at the beginning of this section.
For stimulus set B, there is also a division
into three homogeneous subgroups (-2,-1,0 <
0,1 <2,3), but the effect is more consistent with
our hypothesis, with the percentage of longer
vowel in batte responses increasing from
stimulus condition -2 (3%) to stimulus condi-
tion 3 (94%). A sudden change in the response
is visible going from condition 1 to condition 2.
Differences between the listeners
Clear differences can be observed between the
listeners. In the responses to stimulus set A (see
Figure 4), listener PL for instance follows the
expected pattern with a steady increase in the
number of longer vowel in batte responses
from stimulus condition -2 to condition 3, and
6 5 4 3 2 1
stimulus condition
100
80
60
40
20
0
r
e
s
p
o
n
s
e
p
e
r
c
e
n
t
a
g
e
set A
set B
-2 -1 0 1 2 3
stimulus condition
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
this subject had more than 50% such responses
in condition 3. Listener SH on the other hand
does not seem to be influenced by the differ-
ences between the stimuli, with around 50%
longer vowel in batte responses for all stimu-
lus conditions.
Figure 4. Individual listener response percent-
ages (of longer vowel in batte) for set A
This shows that it was not only the stimuli
which created the differences, although on the
other hand all subjects behaved roughly the
same for stimulus set B (see Figure 5), which
was more strongly controlled in that all stimuli
were derived from a single stimulus pair.
Discussion
The two stimulus sets show very different re-
sults. Set B seems to be most reliable in the ob-
served tendencies, both across stimulus condi-
tions and across subjects. These results show
that the vowel in batte is judged longer than
that in badde when the duration of the
modally voiced +breathy voiced +half of the
preaspirated portion of the vowel together ex-
ceed the duration of the vowel in badde. In
other words, the whole breathy voiced and half
of the preaspirated portion of the vowel contrib-
ute to its perceived length. This corresponds to
previous observations in concerning (post-)as-
piration.
But these observations are not corroborated
by the results for set A. In this stimulus set, the
variation across stimulus conditions is much
smaller and the behaviour of individual listen-
ers shows more variation. This may indicate
that the listeners rely on different cues for
vowel duration, which may also differ across
the stimulus pairs remember that the stimuli
in the listening experiment were not all based
on the same pair as in set B. Also, there was a
strong preference for longer vowel in badde
responses in all conditions for set B. We were
not able to find a reason for this, despite close
inspection of the stimuli.
The results highlight a possible inconsis-
tency in segmentation conventions, since aspi-
ration (also used in Norwegian after fortis plo-
sives) is normally considered part of the pre-
ceding plosive which causes it, whereas
preaspiration is normally segmented as part of
the vowel instead of the following plosive. The
results for their effect on perceived vowel dura-
tion which are reported here and the unpub-
lished results for aspiration show, however, that
both affect the perceived duration of the vowel.
Using traditional segmentation criteria can
therefore lead to wrong conclusions if the seg-
mentation in phonetic studies on vowel dura-
tion is used to make inferences about the effect
of vowel duration in perception.
Of course, this study was limited in its ap-
proach. Other consonantal places of articulation
and the speakers sex, which have been shown
to differ in production studies (e.g. Helgason
and Ringen, 2008), should be taken into con-
sideration in perceptual studies.
References
Helgason, P. and Ringen, C. (2008). Voicing
and aspiration in Swedish stops. J ournal of
Phonetics, doi: 10.1016/j.wocn.2008.02.003
(to appear).
Moxness, B (1997). Preaspirasjon in Trnder.
MA thesis NTNU.
Peterson, G.E. and Lehiste, I. (1960). Duration
of syllable nuclei in English. J. Acoust. Soc.
Am. 32 (6), 693-703.
Van Dommelen, W. (1998). Production and
perception of preaspiration in Norwegian.
In Proc. FONETIK 98, 20-23.
Van Dommelen, W. (1999). Preaspiration in
intervocalic /k/ vs. /g/ in Norwegian. Proc.
ICPhS, San Francisco, 2037-2040.
6 5 4 3 2 1
stimulus condition
80
60
40
20
0
r
e
s
p
o
n
s
e
p
e
r
c
e
n
t
a
g
e
AMK
AV
PL
J A
J
LGR
ILC
SH
Listeners:
-2 -1 0 1 2 3
stimulus condition
Figure 5. Individual listener response percent-
ages (of longer vowel in batte) for set B
6 5 4 3 2 1
stimulus condition
100
80
60
40
20
0
r
e
s
p
o
n
s
e
p
e
r
c
e
n
t
a
g
e
GU
EA
AM
SE
AMK
AV
PL
J A
J
LGR
ILC
SH
Listeners:
-2 -1 0 1 2 3
stimulus condition
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
The fundamental frequency variation spectrum
Kornel Laskowski
1
, Mattias Heldner
2
and Jens Edlund
2
1
interACT, Carnegie Mellon University, Pittsburgh PA, USA
2
Centre for Speech Technology, KTH Stockholm, Sweden
Abstract
This paper describes a recently introduced vec-
tor-valued representation of fundamental fre-
quency variation the FFV spectrum which
has a number of desirable properties. In par-
ticular, it is instantaneous, continuous, distri-
buted, and well-suited to application of stan-
dard acoustic modeling techniques. We show
what the representation looks like, and how it
can be used to model prosodic sequences.
Introduction
While speech recognition systems have long
ago transitioned from formant localization to
spectral (vector-valued) formant representa-
tions, prosodic processing continues to rely
squarely on a pitch trackers ability to identify a
peak, corresponding to the fundamental fre-
quency (F0) of the speaker. Peak localization in
acoustic signals is particularly prone to error,
and pitch trackers (cf. de Cheveign & Kawa-
hara, 2002) and downstream speech processing
applications (Shriberg & Stolcke, 2004) employ
dynamic programming, non-linear filtering, and
linearization to improve robustness. These me-
thods introduce long-term dependencies which
violate the temporal locality of the F0 estimate,
whose measurement error may be better han-
dled by statistical modeling than by (linear)
rule-based schemes. Even if a robust, local, ana-
lytic, statistical estimate of absolute pitch were
available, applications require a representation
of pitch variation and go to considerable addi-
tional effort to identify a speaker-dependent
quantity for normalization (e.g. Edlund &
Heldner, 2005).
In the current work, we describe a recently
derived representation of fundamental frequen-
cy variation (see also Laskowski, Edlund, &
Heldner, 2008a, 2008b; Laskowski, Wlfel,
Heldner, & Edlund, in press), which implicitly
addresses most if not all of the above issues.
This spectral representation, which we will re-
fer to here as the fundamental frequency varia-
tion (FFV) spectrum is (1) instantaneous, not
relying on adjacent frames; (2) continuous, de-
fined for all frames; (3) distributed; and (4) po-
tentially sparse, making it suitable for the appli-
cation of standard acoustic modeling techniques
including bottom-up, continuous statistical se-
quence learning.
In previous work, we have shown that this
representation is useful for modeling prosodic
sequences for prediction of speaker change in
the context of conversational spoken dialogue
systems (Laskowski et al., 2008a, 2008b); how-
ever, the representation is potentially useful for
any prosodic sequence modeling task.
The fundamental frequency varia-
tion spectrum
Instantaneous variation in pitch is normally
computed by determining a single scalar, the
fundamental frequency, at two temporally adja-
cent instants and forming their difference. F0
represents the frequency of the first harmonic in
a spectral representation of a frame of audio,
and is undefined for signals without harmonic
structure. In the context of speech processing
applications, we view the localization of the
first harmonic and the subsequent differencing
of two adjacent estimates as a case of subop-
timal feature compression and premature infe-
rence, since the goal of such applications is not
the accurate estimate of pitch. Instead, we want
to leverage the fact that all harmonics are
equally spaced in adjacent frames, and use
every element of a spectral representation to
yield a representation of the F0 delta.
To this end, we propose a vector-valued re-
presentation of pitch variation, inspired by va-
nishing-point perspective, a technique used in
architectural drawing and grounded in projec-
tive geometry. While the standard inner product
between two vectors can be viewed as the
summation of pair-wise products with pairs se-
lected by orthonormal projection onto a point at
innity, the proposed vanishing-point product
induces a 1-point perspective projection onto a
point at (Figure 1). When applied to two vec-
tors representing a signals spectral content, F
L
and F
R
, at two temporally adjacent instants, the
vanishing-point product yields the standard dot
product between F
L
and a dilated version of F
R
,
or between F
R
and a dilated version of F
L
, for
positive and negative values of , respectively.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Figure 1. The standard dot-product shown as an orthonormal projection onto a point at infinity (left panel),
and the proposed vanishing-point product, which generalizes to the former when (right panel).
The degree of dilation is controlled by the mag-
nitude of . The proposed vector-valued repre-
sentation of pitch variation is the vanishing-
point product, evaluated over a continuum of .
For each analysis window, centered at time t,
we compute the short-time frequency represen-
tation of the left-half and the right-half portion
of the window, leading to F
L
and F
R
, respec-
tively, using two asymmetrical windows which
are mirror images of each other, as shown in
Figure 2.
Figure 2. Left and right windows used for the com-
putation of F
L
and F
R
, respectively, consisting of
asymmetrical Hamming and Hann window halves.
T
0
is 4 ms, and T
1
is 12 ms, for a full analysis win-
dow width of 32 ms. A 32 ms Hamming window is
shown for comparison.
F
L
and F
R
are N512-point Fourier trans-
forms, computed every 8. The peaks of the two
windows are 8 ms apart. The FFV spectrum is
then given by
[ ]
<
=
+
+
0 ,
| ) 2 (
~
| | ] [ |
| ) 2 (
~
| | ] [ |
0 ,
| ] [ | | ) 2 (
~
|
| ] [ | | ) 2 (
~
|
2 / 4 * 2
/ 4 *
2 * 2 / 4
* / 4
r
k F k F
k F k F
r
k F k F
k F k F
r g
N r
R L
N r
R L
R
N r
L
R
N r
L
where, in each case, summation is from
k = -N / 2 +1 to k = N / 2; for convenience, r
varies over the same range as k. Normalization
ensures that g[r] is an energy-independent re-
presentation. The frequency-scaled, interpolated
values
L
F
~
and
R
F
~
are given by
[ ]
[ ], 2 ) 1 ( 2 ) 2 (
~
k F k F k F
L L L L L
+ =
[ ]
[ ], 2 ) 1 ( 2 ) 2 (
~
k F k F k F
R R R R R
+ + +
+ =
where
. 2 2
, 2 2
k k
k k
R
L
+ +
=
=
A sample FFV spectrum, for a voiced
frame, is shown in Figure 3; for unvoiced
frames, the peak tends to be much lower and
the tails much higher. The position of the peak,
with respect to r = 0, indicates the current rate
of fundamental frequency variation. The sample
FFV spectrum shown in Figure 3 thus indicates
a single frame with a slightly negative slope,
that is a slightly falling pitch.
FL FR
T0 +T0
t
FL FR
T0 +T0
t
n
e
,
f
a
r
s
o
u
t
h
G
t
e
b
o
r
g
S
t
o
c
k
h
o
l
m
G
o
t
l
a
n
d
U
p
p
e
r
D
a
l
a
r
n
a
N
o
r
r
l
a
n
d
,
f
a
r
n
o
r
t
h
18 14 8 7 5 1
D
i
s
t
a
n
c
e
f
r
o
m
c
o
r
r
e
c
t
l
o
c
a
t
i
o
n
,
a
r
b
i
t
r
a
r
y
u
n
i
t
Svealand
Gtaland
Figure 4. Gtaland and Svealand listeners average dialect location errors for speakers from six Elert dialect
areas.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Discussion and future work
Our data suggests that Svealand listeners are
less able to locate dialects, except their own
and the accent from Dalarna, which is geo-
graphically nearby. It is probable that human
listeners are better at identifying and locating
dialects originating from their own dialectal
area than those coming from more distant re-
gions. However, the Gtaland listeners were
also good at locating Svealand speakers, pos-
sibly due to the great exposure of these dia-
lects in media. The high error values for the
far north part of Norrland may be explained
by the longer distances between towns and
different-sounding dialects in that area, but
also in part because of the subjects' lesser ex-
posure to northern accents. These are some
examples of results of the dialect location
test. Further analysis of the data is planned in
the near future, particularly using full statisti-
cal analysis. A possible extension is to use
segmentally neutralized stimuli, to focus on
the prosodic features of Swedish regional va-
rieties. We also wish to use listener clustering
as a tool in deciding which factors play the
most important roles for distinguishing the
different Swedish dialect types, which might
lead to modified dialect taxonomy.
Acknowledgements
This work is supported by a grant from the
Swedish Research Council.
References
Bruce G., Granstrm B. and Schtz S. (2007)
Simulating Intonational Varieties of
Swedish. Proceedings of ICPhS XVI,
Saarbrcken, Germany.
Elenius K. (1999) Two Swedish SpeechDat
databases some experiences and results.
Proceedings of Eurospeech 99, 2243-
2246.
Elert C.-C. (1994) Indelning och grnser
inom omrdet fr den nu talade svenskan
en aktuell dialektografi. In Kulturgrn-
ser myt eller verklighet. (Edlund, L.E.
(Ed.)). Ume, Sweden: Diabas, 215-228.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
F
0
in contrastively accented words in three Finnish dia-
lect areas
Riikka Ylitalo
Phonetics, Oulu University
Abstract
Accent in Finnish is realized mainly through an
F
0
rise-fall -pattern. The phonetic realization of
Finnish accent has so far been investigated
most systematically in Northern Finnish. This
study looked at accentuation also in two West-
ern Finnish dialects. It turned out that F
0
reaches a higher level in Northern Finnish ac-
cented words than in those of the Western Fin-
nish dialects, and in the Western dialect of
Turku the timing of the F
0
rise-fall -pattern in
the CV.CV(X) words is different than in all
other words investigated.
Introduction
Contrastive accent in Finnish is realised by an
F
0
rise-fall, additionally it lengthens segment
durations in certain parts of the word (Suomi,
Toivanen & Ylitalo (in preparation), Suomi,
Toivanen & Ylitalo 2006, 239). So far the pho-
netic realisation of accentuation in Finnish has
been studied most systematically in Northern
Finnish. In Northern Finnish F
0
normally rises
during the first mora of the accented word and
falls mainly during its second mora. This rise-
fall -pattern is uniform in words whose seg-
mental structure is different. (Suomi, Toivanen
& Ylitalo 2006, 225-227.)
Method
The aim of this study is to investigate how con-
trastive accent is realised phonetically by
speakers from three different dialect areas of
Finnish, from Oulu, Turku and Tampere re-
gions. Oulu belongs to the area of Northern
Finnish and the other cities belong to the west-
ern dialect area of Finnish, more accurately
Turku belongs to the South-West dialect area
and Tampere to the Hme dialect area. In this
study there were 6 informants from each dialect
area; the informants were born, or have lived
since early childhood in the area they represent.
The informants are all women, and they were
18-25 year-old students at the time of the re-
cordings. The informants read the target words
embedded in frame sentences from a computer
screen in a studio. Their speech was recorded
(44.1 kHz, 16 bit) directly to hard disc, the
Tampere informants speech was first recorded
on a mini-disc and later copied to hard disc. It
needs to be pointed out that the speech material
used in this study is definitely not dialect in the
proper meaning of the word, even though the
word dialect is used to describe the informants
backgrounds; the speakers spoke their locally
coloured variants of Standard Finnish.
The material consisted of 10 words repre-
senting each of the structures CV.CV,
CV.CV.CV and CV.CV.CVC.CV (for example
sika, sikala, sikalasta), similarly 10 words of
each of the structures CVV.CV, CVV.CV.CV
and CVV.CV.CVC.CV (for example siika, Sii-
kala, Siikalasta), 5 CVC
a
.C
a
V, 5 CVC
a
.C
a
V.CV
and 5 CVC
a
.C
a
V.CVC.CV -words (for example
sepp, Seppl, Sepplst), 3 CVC
aVoice-
less
.C
b
V, 3 CVC
aVoiceless
.C
b
V.CV and 3 CVC
a-
Voiceless
.C
b
V.CVC.CV -words (for example
sotka, Sotkamo, Sotkamosta) and 2 CVC
a-
Voiced
.C
2
V, 2 CVC
aVoiced
.C
b
V.CV and 2 CVC
a-
Voiced
.C
b
V.CVC.CV -words (for example kanta,
kantama, kantamasta). Nearly all of the C
2
s of
CV.CV(X) and CVV.CV(X) words and the
C
a
.C
a
sequences of the CVC
a
.C
a
V(X) words are
voiceless. Altogether there were 1620 target
word tokens, 540 from each dialect area.
The target words were placed in the frame
sentences in a position in which they would be
contrastively accented. The informants were
also asked to emphasize capitalised words. For
example, the target word koti home was
placed in the following frame sentence: Sanoin
ett Annan KOTI paloi, en sanonut ett Annan
KOULU paloi I said that Annas HOME burnt,
I didnt say that Annas SCHOOL burnt. The
F
0
values were measured with Praat at the fol-
lowing points: at the beginning and at the end
of the syllable preceding the target word, at the
beginning, at the middle and the end of the first
syllable of the target word as well as at the
temporal midpoint between the beginning and
the middle, and at the temporal midpoint be-
tween the middle and the end. That is, F
0
was
measured at five equidistant points of the first
syllable. And similarly for the second syllable.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
In the third syllable F
0
was measured at the be-
ginning, at the middle and at the end and in the
fourth syllable at the beginning and at the end.
There were more measurement points in the
first and the second syllables of the words than
in later syllables because the accentual F
0
curve
is mostly realised during the first two syllables,
and also because especially the fourth syllable
is often reduced. F
0
was also measured at the
beginning, at the middle and at the end of the
syllable following the target word, and at the
highest peak in the target word. Also the tem-
poral location of the highest F
0
peak from the
word onset was measured. All the target words
produced by the informants were listened be-
fore making the analyses, and mispronounced
target words, as well as target words that were
produced unaccented, were rejected.
Results
Every measured F
0
-value was normalised be-
fore performing statistical analyses by subtract-
ing the F
0
-value at the beginning of the syllable
preceding the same target word from it, and
then adding the mean of the F
0
-values at the
beginning of syllables preceding all the target
words to it. In the CV.CV(X) words the F
0
of
the first syllable was in all five measurement
points higher in Oulu than in the other dialects,
which did not differ from each other by their F
0
values (point 1 [F(2,45) =11.04, p <0.001],
point 2 [F(2,45) =10.59, p <0.001], point 3
[F(2,45) =9.80, p <0.001], point 4 [F(2,45) =
8.16, p =0.001], point 5 [F(2,45) =7.22, p <
0.01]). At the three first measurement points of
the second syllable there were no significant F
0
differences between dialects, but at the last two
measurement points of the second syllable F
0
was higher in Turku than in Oulu, and Tam-
peres F
0
values did not differ from those of the
other dialects (point 4 [F(2,45) = 3.23, p <
0.05], point 5 [F(2,45) =4.34, p <0.05]). At the
first and the second measurement point of the
third syllable there were no significant F
0
dif-
ferences between dialects. At the third meas-
urement point of the third syllable F
0
was
higher in Turku than in Oulu, and Tampere
didnt differ from the other dialects [F(2,30) =
4.27, p <0.05]. At the fourth syllable the F
0
values were statistically similar in all the dia-
lects. In Oulu the F
0
peak was located approxi-
mately 172 ms, in Turku 264 ms and in Tam-
pere 207 ms from word onset. The location of
the peak was statistically the same in Oulu and
in Tampere, but in Turku the peak was signifi-
cantly later in the word than in the other dia-
lects [F(2,45) =10.09, p <0.001].
Figure 1. F
0
in the CV.CV(X) words (the segment
duration values represent the durations in the four-
syllabic words). The measurement points between
which additional comparisons were made (see be-
low) are marked with rectangles.
Because the segments have different durations
in different dialects, the syllable-bound meas-
urement points are located at different absolute
distances from the word onset in different dia-
lects. It can be seen in Figure 1, at roughly 290
ms, that in the four-syllabic CV.CV(X) words
the first second-syllable Turku measurement
point is closer to the second than the first sec-
ond-syllable Oulu measurement point. There
was no significant F
0
difference between Oulu
and Turku at the respective first second-
syllable measurement points, but when the
temporally close measurement points were con-
sidered (those inside the leftmost rectangle in
Figure 1), there was a significant difference: F
0
was higher in Turku than in the other dialects,
in which F
0
level was statistically the same
[F(2,51) =4.10, p <0.05]. A similar compari-
son was made between Oulu and Tampere sec-
ond-syllable third measurement points and
Turku second-syllable second measurement
point (the measurement points within the mid-
dle rectangle), and it turned out that F
0
was
higher in Turku than in the other dialects, in
which the F
0
values were statistically same
[F(2,51) =6.56, p <0.01]. A comparison be-
tween Oulu and Turku second-syllable fourth
measurement points and Tampere second-
syllable fifth measurement point (the rightmost
rectangle) revealed that F
0
was still higher in
Turku than in Oulu, but Tampere did not differ
statistically from the other dialects [F(2,51) =
3.88, p <0.05]. Some similar F
0
comparisons
between nominally different, but temporally
close measurement points were made also in
the other word structures than CV.CV(X) in
situations where there were no significant F
0
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
differences between the dialects at the nomi-
nally same measurement points, but there were
nominally different measurement points tempo-
rally closer to each other than the nominally
same ones. However, none of those compari-
sons revealed any significant dialect differ-
ences.
In the CVV.CV(X) words (see Figure 2), F
0
was higher in Oulu than in the other dialects
(which did not differ from each other) at the
first [F(2,45) =9.62, p <0.001], at the second
[F(2,45) =10.37, p <0.001] and at the third
[F(2,45) =6.88, p <0.01] measurement points
of the first syllable. In later measurement points
there were no significant F
0
differences be-
tween the dialects, except the third-syllable
third measurement point, where F
0
was lower in
Oulu than in Turku, and Tampere F
0
did not
differ from those of the other dialects [F(2,30)
=4.09, p <0.05]. The location of the F
0
peak
was statistically similar in all the dialects.
Figure 2. F
0
in the CVV.CV(X) words (the duration
values represent the durations in the four-syllabic
words).
In the CVC
aVoiceless
.C
b
V(X) words (see Figure
3), F
0
was higher in Oulu than in the other dia-
lects (which did not differ statistically from
each other) at the first-syllable first measure-
ment point [F(2,45) =9.48, p <0.001] and at
the first-syllable second measurement point
[F(2,45) = 9.30, p < 0.001]. At the third
[F(2,45) =8.56, p =0.001], fourth [F(2,45) =
7.03, p <0.01] and fifth [F(2,45) =5.60, p <
0.01] measurement points of the first syllable
F
0
was higher in Oulu than in Tampere, but
Turku did not differ from the other dialects.
Later in the word there were no significant F
0
differences between the dialects. The dialects
did not differ from each other by the location of
the F
0
peak.
Figure 3. F
0
in the CVC
aVoiceless
.C
b
V(X) words (the
duration values represent the durations in the four-
syllabic words).
In the CVC
aVoiced
.C
2
V(X) words (Figure 4), F
0
was higher in Oulu than in the other dialects
(which did not differ from each other statisti-
cally) at the four first measurement points of
the first syllable (point 1 [F(2,45) =15.23, p <
0.001], point 2 [F(2,45) =15.34, p <0.001],
point 3 [F(2,45) =15.47, p <0.001], point 4
[F(2,45) =7.57, p <0.001]). At the later meas-
urement points of CVC
aVoiced
.C
2
V(X) words,
there were no significant F
0
differences be-
tween the dialects. Also the location of the F
0
peak was statistically the same in all the dia-
lects.
Figure 4. F
0
in the CVC
aVoiced
.C
2
V(X) words (the
duration values represent the durations in the four-
syllabic words).
In the CVC
a
.C
a
V(X) words (Figure 5) F
0
was
higher in Oulu than in the other dialects (which
did not differ from each other statistically) at
the first-syllable first [F(2,45) = 12.81, p <
0.001] and first-syllable second [F(2,45) =
12.70, p <0.001] measurement points. At the
third [F(2,45) =9.41, p <0.001], at the fourth
[F(2,45) =7.96, p =0.001] and at the fifth
[F(2,45) =6.92, p <0.01] measurement points
of the first syllable F
0
was higher in Oulu than
in Tampere, but Turku did not differ from the
other. Later in the CVC
a
.C
a
V(X) words there
were no significant F
0
differences between the
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
dialects, and the location of the F
0
peak was
statistically the same in all the dialects.
Figure 5. F
0
in the CVC
a
.C
a
V(X) words (the
duration values represent the durations in the
four-syllabic words).
An additional experiment
It turned out above that the Turku CV.CV(X)
words differ from all the other words investi-
gated: in these words, F
0
is at its highest at the
second-syllable first measurement point, not
during the first syllable. Because nearly all of
the C
2
s of the CV.CV(X) words in the original
material are voiceless, it remained unclear
where the peak of F
0
would be if C
2
were
voiced. To resolve this question, some extra
materials were recorded: 5 informants, who
were among the Turku informants in the first
recordings, read 30 CV.CV words whose both
consonants were voiced. The words were
placed in frame sentences in positions where
they would have contrastive accent, for exam-
ple the target word lumi snow was placed in
the sentence Sanoin ett kaikki LUMI katosi, en
sanonut ett kaikki LOSKA katosi I said that
all SNOW disappeared, I didnt say that all
SLUSH disappeared. The recordings were per-
formed technically in the same way as the first
recordings. F
0
was measured at five equidistant
points of the first syllable in the same manner
as in the first study, in the second syllable F
0
was measured at seven points: in addition to the
five measurement points used in the first study,
F
0
was also measured at the temporal midpoint
between the first and the second measurement
points and at the temporal midpoint between
the second and the third measurement points. F
0
in the syllables preceding and following the tar-
get word was measured in the same way as in
the first study. Also the temporal location and
the Hz value of the F
0
peak within the word
was measured.
Figure 6. F
0
and segment durations in the new
Turku CVCV words (with voiced consonants). The
mark representing the F
0
peak is indicated by an
arrow.
Figure 6 shows the main result of the additional
experiment: in Turku dialect CV.C
2
V words F
0
reaches its peak value a little before the middle
of C
2
, if C
2
is voiced; in the extra material tar-
get words the F
0
peak is located approximately
215 ms from word onset.
Discussion
In all the word structures investigated, F
0
was
higher in Oulu than in the other dialects, in the
first syllable, or at least at the beginning of the
first syllable. Otherwise, the F
0
curves were
quite similar across the dialects, with one ex-
ception: in the CV.CV(X) words of Turku dia-
lect, the F
0
peak occurred a little before the
middle of C
2
(as observable in words in which
this consonant is voiced). In all other word
structures investigated the F
0
peak occurred at
the end of the first syllable. In the second sylla-
ble of the CV.CV(X) words, F
0
is also higher in
Turku than in the other dialects. (C)V.CV(X) is
the only word structure in Finnish, in which the
words second mora is in the second syllable of
the word. Obviously Turku dialect manages
this situation in a way that is different from that
used in the two other investigated dialects. Also
the segment durations in Oulu, Turku and
Tampere dialects in the word structures dis-
cussed in this paper have been investigated; the
results of those investigations will be reported
in the future.
References
Suomi K, Toivanen J . & Ylitalo R. (2006).
Fonetiikan ja suomen nneopin perusteet.
Helsinki: Gaudeamus.
Suomi K, Toivanen J . & Ylitalo R. (in prepa-
ration). Finnish sound structure.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Improving speaker skill in a resynthesis experiment
Eva Strangert
1
and Joakim Gustafson
2
1
Department of Language Studies, University of Ume
2
CSC, Department of Speech, Music and Hearing, KTH
Abstract
A synthesis experiment was conducted based on
data from ratings of speaker skill and acoustic
measurements in samples of political speech.
Features assumed to be important for being a
good speaker were manipulated in the sample
of the lowest rated speaker. Increased F0 dy-
namics gave the greatest positive effects, but
elimination of disfluencies and hesitation paus-
es, and increased speech rate also played a
role for the impression of improved speaker
skill.
Introduction
The current study concerns public speaking
with a focus on qualities that characterize
speakers held to be good speakers. By that
we mean persons capable of adjusting their
speech to the maximum of what is possible in
order to get across to an audience. Although
public speakers vary in the extent to which they
meet this criterion of speaker skill, such a ca-
pability is a great asset, not the least in politics.
It is assumed that this capability to a great ex-
tent depends on prosody, in particular how
prosody is used to signal intentions of the
speaker and attitudes towards the listener.
In a resynthesis experiment we modify the
original sample of one of the speakers analyzed
in a previous study (Strangert, 2007) concerned
primarily with subjective ratings of speaker
qualities. The speakers were chosen so as to
represent a great variation in order to be able to
single out features with a potential for distin-
guishing good speakers from less good ones.
The study also included some restricted acous-
tic-prosodic measurements of the speech sam-
ples. In closely related studies based on English
and English and Arabic, respectively, Rosen-
berg and Hirschberg (2005) and Biadsy et al.
(2008) had subjects rate charisma and corre-
lated these ratings with a great number of
acoustic features.
In the present study we build on the analy-
ses in Strangert (2007) and, in addition, extend
the acoustic-prosodic analysis presented there
to include more features potentially relevant for
the impression of speaker skill.
Rating data
The speech sample used in the resynthesis ex-
periment was one of those (16 in total) ana-
lyzed in Strangert (2007). They were re-
cordings (audio and video) from debates in the
Swedish parliament (Riksdagen) representing a
variety of speakers (more and less skilled, male
and female) and styles (read and spontaneous).
The samples were rated by 18 native Swed-
ish subjects who gave their opinion on a num-
ber of qualities on a five-point scale from no,
absolutely not (0) to yes, absolutely (4). The
ratings included an overall good-speaker rating,
good-speaker defined as a person capable of
catching the attention and interest of an audi-
ence through her/his way of communicating.
This rating was matched to all the other ratings
and found to have strong positive correlations
with expressive, powerful, involved and trust-
worthy (all with r .89; with correlations based
on means of all individual ratings for each qual-
ity). There were also positive correlations with
aggressive, accusatory and agitating (r . 65),
and negative with humble (r =-.55 and insecure,
hesitant and monotonous (r -.86).
The rating scores varied considerably be-
tween the speakers. The mean good-speaker
score varied from 3.39 for the speaker rated
highest to 0.56 for the lowest rated speaker.
Acoustic analysis
The acoustic measures included F0 range (in
semitones, and as a ratio of mean F0 maximum
of focused words to mean F0) and number of
focus positions. Further, minimum, maximum
and mean F0 and mean of F0 maximum of fo-
cused words were measured separately for the
male and female speakers. (Focused words
were identified by the first author through lis-
tening.) Measurements also included a number
of duration features and speech rate measures,
which we leave out here concentrating on those
features correlating strongly with good-
speaker.
Table 1 summarizes F0 measures and their
correlations with the (mean) good-speaker
scores. As shown, the number of focused words
varies considerably. There is a positive, corre-
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Feature max min r p
Mean F0 max of foc
words/mean F0
1.85 1.17 .61 .01*
F0 range, 75-25
percentiles (ST)
8.78 2.44 .61 .01*
female 384 219 .87 .005**
male 233 150 .76 .03*
female 516 265 .82 .01*
male 325 190 .63 .09
female 246 180 .62 .10
male 164 113 .65 .08
.07
Mean F0 max of
focused words
F0 max
F0 mean
Focused words (N) 23 3 .47
lation of .47 with the good-speaker scores, but
it does not reach significance (p=.07).
Table 1. F0 measures and their correlations with
the mean good-speaker scores.
Significant and positive correlations include F0
range, measured in semitones and as a ratio of
mean F0 maximum of focused words to mean
F0. The two measures give similar results
(r=.61; p<.05) and we therefore confine our-
selves to the semitone data in the following.
Ranges in semitones between the 25% and 75%
points in the F0 distribution vary between 8.78
and 2.44 for individual speakers with a median
of 4.7. These figures may be compared to simi-
larly computed ranges (25% -75% distribution
points) extracted from 498 speakers in the Swe-
dish SpeeCon database (Carlson et al., 2004).
The great majority of these ordinary speakers
had a range of 2-5 semitones. Half of our
speakers ranges then fell outside this more re-
stricted interval.
There is a significant correlation of good
speaker with F0 maximum, but only for the fe-
male speakers (r=.82; p<.05). For the mean of
F0 maximum of focused words, on the other
hand, there is a positive correlation for both the
male (r=.76; p<.05) and female (r=.87; p<.01)
speakers. This feature is comparable to the
pitch range measure (mean HiF0, the highest
accented pitch peak) in Biadsy et al. (2008)
which correlated positively with charisma in
American English and Palestinian Arabic when
rated by American and Palestinians as well as
Swedish subjects. This feature in addition was
found to be more important for the Swedish
subjects than for the others
Thus F0 dynamics appears to be primarily
associated with the extent to which the range is
widened upwards; the correlation with F0 min-
imum is insignificant. Neither is mean F0 over
the speech sample (excluding silent pauses)
sig-nificant, although both female and male
speakers correlations exceed .6 (with p=.08
and .10, respectively). Biadsy et al. (2008) on
the other hand, found significant correlations
between mean F0 and charisma ratings, irre-
spective of the language of the raters, for both
American and Palestinian Arabic speech mate-
rials.
Fluency and speaking style
As some speakers were perceived as far more
fluent than others, a measure of disfluency was
calculated. The number of positions with a slip
of the tongue, a repetition or a repair was de-
termined through listening by the first author.
This measure showed a strong negative correla-
tion (r=-.72; p<.01) with good speaker. A simi-
lar negative correlation was found also in the
cross-cultural comparison by Biadsy et al.
(2008) with the exception of the Swedish
judgments of American English. We note this
difference between Swedish judgments of Swe-
dish and English, respectively, which may be
ascribed to cultural influences.
Disfluencies occur primarily in speech pro-
duced spontaneously as a result of problems
with the planning of what to say next. As some
of our speakers read from a manuscript and
some spoke more freely, we could relate the
disfluency scores to the read vs. spontaneous
style of speaking. Even though the three most
disfluent speakers were in fact speaking spon-
taneously, there was no obvious relation taking
all speakers into account. Neither was there any
obvious relation between speaking style and the
good-speaker rating.
Resynthesis experiment
In the acoustic analysis, F0 features, in particu-
lar a wide F0 range and high peaked focused
words, were found to give high ratings of
good speaker, while the opposite, a smaller
range and focused words with lower peaks, was
given low ratings. Also, the good speakers were
to a great extent fluent, while the less good
ones had lots of repetitions, repairs etc. These
results were elaborated in an experiment in
which the sample of the speaker with the lowest
score (0.56) for good speaker was modified
in several ways. The assumption was that, rely-
ing on our production results, we could im-
prove the perceived skill of speaking.
The scores for the selected (male) speaker
were high for insecure, hesitant, monotonous
and low for expressive, powerful, aggressive
and trustworthy and he had the highest score of
all speakers for humble. He was also the second
most disfluent speaker with a total of 12 disflu-
ency positions, and he had the smallest F0
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
1 2 3 4 5 6 7 8
M:F0_ M:flu_M:rate 8 1 1 2 1
M:F0_ M:flu_O:rate 1 4 5 1 1 3
M:F0_ O:flu_ M:rate 4 3 3 1 1 3
M:F0_ O:flu_ O:rate 1 1 4 2 1 3 4,5
O:F0_ M:flu_M:rate 2 2 2 4 2 4,5
O:F0_ M:flu_O:rate 1 2 1 4 2 2 6
O:F0_ O:flu_ M:rate 1 1 1 5 3 1 6
O:F0_ O:flu_ O:rate 2 1 3 6 7,5
Ranking
Md Stimuli
range (2.84 ST), F0 maximum (190 Hz), and
mean F0 maximum of focused words (150Hz).
This speaker also was the slowest with a speech
rate (including pauses) of 3.46 syllables per
second. Thus, this speaker is a natural candi-
date for the resynthesis experiment.
Hypotheses
The features to be evaluated included the two
that had the highest correlations (positive and
negative, respectively) with good speaker: F0
dynamics, and disfluencies. (In the following,
when referring to experimental manipulations,
we use the more neutral term fluency instead
of disfluencies.) As the selected speaker was
extremely slow, we also included speech rate,
although the speech rate features overall corre-
lated insignificantly with good speaker.
We hypothesized that of these features, rate
would be the least effective for improvement of
speaker skill. Concerning the other two, we
could not decide in advance between alterna-
tive a) and b). There might also be interactions
between the features, giving a third alternative:
a) F0 dynamics > fluency > speech rate
b) fluency > F0 dynamics > speech rate
c) F0 dynamics, fluency and speech rate interact
Experimental setup
To create the experimental stimuli, we used the
KTH resynthesis toolkit EXPROS (Gustafson and
Edlund, in press) together with the Mbrola di-
phone synthesis toolkit (Dutoit et al., 1996).
This was a three step process: first the EXPROS
toolkit was used to automatically generate the
data needed to build a new Mbrola diphone da-
tabase from the original speech sample (36 sec-
onds in length). Then the Mbrola toolkit was
used to build a customized Mbrola mini-voice.
Finally, EXPROS was used to modify the pro-
sodic features of the original speech sample.
The following manipulations were performed:
F0 dynamics: The pitch scale was transformed to a semitone
scale. The mean pitch was increased by two semitones and the
range was expanded, so that the standard deviation increased
from 2 semitones to 4.
Fluency: Reduction of disfluencies was made by cutting out
slips of the tongue and repetitions.
Speech rate: Speech rate was increased by 5% and long silent
hesitation pauses were considerably shortened.
Thus, there were eight stimuli (2x2x2) includ-
ing all combinations of original and modified
F0 dynamics (O/M F0), fluency (O/M fluency)
and speech rate (O/M rate).
The 12 subjects in the test, all academic
staff or students in areas other than phonetics,
made a ranking of the eight versions. They did
so using an interactive computer program im-
plementing a visual sort and rate/rank method.
Each of the eight versions was represented
by an icon in random order on the computer
screen. The subjects were instructed to rank
them from best (1) to worst (8) in reference to
the criterion for a good speaker, that is, a per-
son capable of catching the attention of an au-
dience through her/his way of speaking. Be-
fore coming up with the ordering they pre-
ferred, the subjects could listen to the stimuli as
many times as they wished and try different
rankings by moving the icons around.
Results
In a complex task as this, we cannot expect to-
tal uniformity between subjects rankings. De-
spite this, there is a fair degree of consistency;
the correlation between subjects (Kendalls W)
is .48 (<.001).
Table 2. Median and distribution of 12 subjects
rankings (1=best - 8=worst) of eight versions of a
speech sample with modifications (M) of F0, flu-
ency and speech rate in the original (O) version.
The results support the general assumption that
perceived speaker skill can be improved by
modifications such as those suggested by our
production data. The general trend is that the
more modifications, the higher the ranking.
This is demonstrated in Table 2, which shows
the distribution of rankings for the eight stimu-
lus versions together with the median rankings.
The results indicate that F0 modifications play
the major role, with modifications of fluency
and speech rate being second and third. The re-
sults pooled across all subjects then come close
to an ordering according to hypothesis a) in
terms of perceptual weight. Even though all
rankings do not correspond exactly with this
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
ordering, they are nevertheless concentrated
around it as indicated by the shadowed area.
A more detailed analysis reveals interesting
differences between subjects. Some had ratings
reflecting a completely systematic ordering of
the features in accordance with hypothesis a),
while others were less systematic. This is not
unexpected, as making judgments about such a
phenomenon as speaker skill most reasonably is
not a simple task. The features under investiga-
tion may be expected to interact in complex
ways, but individual experiences and prefer-
ences may also play a role. Several of the sub-
jects after the test spontaneously commented on
their impressions of the stimuli. Some of them,
for example, found slips and other disfluencies
to be very disturbing, while others looked upon
the same phenomena as something natural and
more or less unavoidable. Still most of them,
according to the general result, favored a modi-
fied F0 range and some reported that they eas-
ily could divide the eight versions in two
groups (original and modified F0 dynamics),
but that priorities within these groups were dif-
ficult.
Conclusions and future work
In this study, the focus was on features contrib-
uting to the impression of a good speaker, a
person capable of catching the attention and
interest of an audience through her/his way of
communicating. We conducted a resynthesis
experiment based on the results from good-
speaker ratings combined with acoustic meas-
urements from a number of speakers. Acoustic
features which correlated significantly with the
subjective ratings were perceptually evaluated.
Our data had revealed a strong positive correla-
tion between good speaker and F0 peak height
of focused words and F0 range. A strong but
negative correlation was found for disfluency.
In the resynthesis evaluation these features,
combined with speech rate, were manipulated
through modifications in the speech sample of
the speaker rated lowest. By increasing F0 dy-
namics, eliminating disfluencies and hesitation
pauses, and speeding up the speech, the impres-
sion of speaker skill improved considerably.
Modifying F0 dynamics produced the greatest
effects with changes of (dis)fluency and speech
rate, respectively, second and third. The results
support related findings pointing to the impor-
tance of F0 variability in ratings of charisma
(Biadsy et al., 2008) and liveliness (Traun-
mller and Eriksson, 1995; Hincks, 2005).
Combined with more acoustic data, resyn-
thesis evaluations like the one conducted here
could shed further light on what makes a speak-
er a good speaker. Also, the results from the
synthesis experiment open up for useful appli-
cations, for example speaker training.
Acknowledgements
We thank John Lindberg and Roberto Bresin,
KTH, for making the evaluation software avail-
able for the perceptual ranking. We also thank
all subjects for their participation in the ex-
periments. The work has been supported by
funding from the Swedish Research Council.
References
Biadsy, F., Rosenberg, A., Carlson, R., Hirsch-
berg, J. and Strangert, E. (2008) A Cross-
Cultural Comparison of American, Palestin-
ian, and Swedish Perception of Charismatic
Speech. To appear in Proc. Speech Prosody,
Campinas, Brazil.
Carlson. R., Elenius, K. and Swerts, M. (2004)
Perceptual Judgments of Pitch Range. Proc.
Speech Prosody, Nara, Japan, 689-692.
Dutoit, T., Bataille, F., Pagel, V., Pierret, N.,
Van der Vreken, O. (1996) The MBROLA
Project: Towards a Set of High-Quality
Speech Synthesizers Free of Use for Non-
Commercial Purposes. Proc. Interspeech.
Philadelphia, USA.
Gustafson, J. and Edlund, J. (In press) expros: a
toolkit for exploratory experimentation with
prosody in customized diphone voices. To
appear in Proc. 4th IEEE Workshop on Per-
ception and Interactive Technologies for
Speech-Based Systems. Kloster Irsee, Ger-
many.
Hincks, R. (2005) Computer Support for Learn-
ers of Spoken English, Diss. Speech and
Music Communication, KTH, Sweden.
Rosenberg, A. and Hirschberg, J. (2005) Acou-
stic/prosodic and lexical correlates of char-
ismatic speech. Proc. Interspeech, Lisboa
Portugal, 513-516.
Strangert, E. (2007) What makes a good spea-
ker? Subjective ratings and acoustic meas-
urements. Proc. Fonetik 2007, TMH-QPSR,
50, 29-32.
Traunmller, H. and Eriksson, A. (1995) The
perceptual evaluation of F
0
excursions in
speech as evidenced in liveliness estima-
tions. JASA, 97 (3) 1905-1915.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Second-language speaker interpretations of intonation-
al semantics in English
Juhani Toivanen
Academy of Finland & University of Oulu
Abstract
Research is reported on the way in which Fin-
nish speakers of English interpret the seman-
tic/pragmatic meaning of the fall-rise intona-
tion in spoken English. A set of constructed
mini-dialogues were used for listening tests in
which the test subjects were to interpret the
meaning of the fall-rise tone. To obtain base-
line data, a group of native speakers of English
listened to the same material, with the same in-
terpretative task. The results indicate that the
native speakers consistently interpreted the fall-
rise pattern as conveying reservation (or iro-
ny), whereas the non-native speakers perceived
a reserved meaning only if the lexical con-
text explicitly supported such an interpretation.
Introduction
The semantic/pragmatic meaning of the fall-rise
intonation contour has attracted a great deal of
attention in the literature on English prosody
(the fall-rise is transcribed with the diacritic
V
below). Basically, the tone is associated with
reservations, implications and doubts. It can
also be argued that the fall-rise conveys uncer-
tainty or in-completion (as all rising tones
do) but the fall-rise is apparently associated
with especially delimiting open meanings; it
has sometimes, and quite rightly, been referred
to as the contingency tone in English intona-
tion. That is, the fall-rise is often an indication
that the proposition or argument is correct only
under certain circumstances. Roach (1991) uses
the terms limited agreement and response
with reservations to describe the pragmatic
meaning of the fall-rise. In the literature on the
subject, the following meanings, for example,
have been attributed to the tone: implicatori-
ness, reservation and contradiction, lack
of complete commitment, and strong implica-
tion. The common denominator is, clearly, an
indication of some concealed doubt or contrast:
the speaker may say one thing and mean some-
thing else. That is, a subtle prosody-dependent
pragmatic meaning is created.
From the viewpoint of second language ac-
quisition, intonation can be seen as belonging
to the pragmatic aspects of language. Pragmat-
ics is probably one of the most difficult areas of
second language acquisition in general. It seems
likely that misunderstandings resulting from
different ways of interpreting intonational
meaning will interfere with a common dis-
course space between the native speaker and the
non-native interlocutor even if the non-native
speech may be otherwise (e.g. grammatically)
quite acceptable.
In this light, the study of the cross-linguistic
interpretation of the semantic meaning of Eng-
lish intonation contours is a most profitable un-
dertaking. For the purpose of this study, the
meaning of the fall-rise pattern was chosen for
scrutiny. On the one hand, this contour has a
specific meaning in (British) English intona-
tion; on the other hand, the fall-rise does not
have a counterpart in Finnish intonation. As Ii-
vonen (1998) points out, a final rise is rare in
Finnish and the rules found in French, English,
and German associated with the use of final rise
do not exist in Finnish.
Experiment
To obtain suitable test material, a native speak-
er of English, a professional phonetician, was
asked to produce a declarative utterance with a
falling-rising nuclear tone on the last word. The
test utterance, by itself and combined with other
utterances, constituted the material used in the
listening test. All the speech material used in
the experimental setup was tape-recorded with
a high-quality microphone and a DAT recorder,
and transferred onto hard disk (44.1 kHz, 16
bit). The test utterance was the following (with
the nuclear tone on the latter syllable of de-
gree):
Shes got a good
V
degree
The speaker also produced a number of other
utterances to create coherent lines in the mini-
dialogues: the speaker is referred to as Bill.
Another native male speaker of English
(John) was the interlocutor and produced the
other lines in the dialogues (see the Notes sec-
tion). Four mini-dialogues contained the test
utterance; in three mini-dialogues the test utter-
ance was accompanied by one or two additional
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
utterances to produce Bills line. Four addi-
tional mini-dialogues served as distractors: they
did not contain the test utterance. Only the test
utterance contained a falling-rising intonation:
all the other utterances in the test dialogues
ended on simple falls. The distractor dialogues
contained both falls and rises but not fall-rises.
The listeners were all university students:
the Britons majored in non-linguistic subjects,
while the Finns were first-year university stu-
dents of English. Ten Britons and ten Finns, all
female speakers in their early twenties, partici-
pated in the listening test. The test was adminis-
tered in a language laboratory; the listeners had
written transcripts of the dialogues in front of
them, and the line to which they were to pay
attention was underlined (see the Notes sec-
tion). The test subjects listened to each dialogue
once and chose one of six descriptive labels for
the line whose attitudinal/emotional content
they were to judge. The labels describing the
lines were the following: friendly, reserved,
bored, joyful, casual, and ironical.
One or two things should be pointed out at
this stage. Firstly, dialogues 1 and 3 are basical-
ly comparable to the examples given in the in-
tonation literature: the most typical semantic
meaning of an intonation contour is often dis-
cussed out of context or in a lexical context
which clearly supports the supposed meaning of
the contour. In dialogue 3, the reservation
conveyed by the fall-rise is very much in
agreement with the lexical content of the line.
The examples given by Cruttenden (1997), for
example, are rather similar to the test utterance
in dialogues 1 and 3:
You wont
V
like it
Be careful you dont
V
fall
I like
V
John (but)
In dialogues 5 and 7, by contrast, the implica-
tion or doubt expressed by the fall-rise con-
flicts with the positive ideas expressed ver-
bally. The interesting question is, of course,
whether the pragmatic force of the fall-rise is
strong enough to counteract the lexical meaning
of the lines a point rarely discussed in the lite-
rature.
All the other utterances in the test dialogues
ended on a tone which could be described as a
high-fall (i.e. a relatively wide unidirectional
f0 movement). This tone is often assumed to
represent the most typical intonation contour
with declaratives (Cruttenden 1997). The high-
fall is common even with polar questions, at
least in informational conversation. Thus it can
be claimed that the test dialogues were intona-
tionally neutral apart from the utterance with
the fall-rise, i.e. the fall-rise is clearly a devia-
tion from the general falling trend and should
thus attract some special attention. However,
since the distractor dialogues contained rising
tones, the test sentence did not stand out as the
only utterance ending on a rising contour.
Results
In dialogue 1, nine native speakers of English
chose the attitudinal label reserved, and one
chose the term ironical. It seems clear, then,
that, to the native ear, even a largely decontex-
tualized utterance with the fall-rise sounds pre-
dominantly negative. The responses of the Fin-
nish informants, by contrast, were much more
heterogeneous. The label friendly had the
most votes (4), the other interpretations were
casual (3), reserved (2) and joyful (1).
The Finns apparently paid attention mainly to
the lexical content of the utterance. On the oth-
er hand, it might be the case that the Finnish
informants associated the falling-rising intona-
tion with friendliness. After all, (low) rising
intonation often accompanies conventionally
polite declarative utterances in spoken English.
The small amount of data, of course, prevents
one from making any far-reaching conclusions.
An interesting question is whether the test
utterance, produced with a simple falling into-
nation, might still convey implications or
reservations in dialogue 1. That is, could the
utterance (Shes got a good degree), as a re-
sponse to the question (What do you think of
her?) convey a conversational implicature of
some kind? It might be possible that the speak-
er deliberately flouts the maxims of quantity
and relevance in saying far too little. The situa-
tion might resemble the famous (and extremely
concise) critique of a book:
The book is well-bound and free of ty-
pographic errors
The review flouts the maxim of quantity, and
the implicature is, clearly, that the book is terri-
ble. In dialogue 1, even without the fall-rise, the
implicature might be something like the lady
is well-educated but is a difficult person.
However, it must be emphasized that this is
basically only speculation.
Brown and Yule (1983) describe the di-
lemma of the hearer and discourse analyst as
follows: since the analyst has only limited
access to what a speaker intended, or how sin-
cerely he was behaving, in the production of a
discourse segment, any claims regarding the
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
implicatures identified will have the status of
interpretations. This latitude of interpretation
would probably obtain in dialogue 1 if the test
utterance were spoken with a falling tone. Natu-
rally, the semantic interpretation of the utter-
ance produced with vs. without a fall-rise
should be investigated in a separate study.
Dialogue 3 is very different from dialogue
1: here the reservations and doubts are ex-
pressed both verbally and prosodically. Here, as
in dialogue 1, nine native speakers heard res-
ervations, while one interpreted the speaker as
ironical. Eight Finns chose the label re-
served, one chose bored and one ironical.
The situation seems rather straightforward: as
the lexical content is in harmony with the into-
nation contour, the reserved meaning was
readily perceivable. However, it is likely that
the Finns again regarded the lexical content as
the major factor contributing to the attitudin-
al/pragmatic meaning of the line. That is, even
without a falling-rising intonation, the attitude
might have been obvious (the same probably
goes for the British test subjects). In any case,
in this dialogue, the lexical meaning prejudices
the listener much more than in dialogue 1.
Dialogues 5 and 7 can be discussed togeth-
er. In both of them, the lexical meaning of the
test line is apparently very positive while the
tone is, again, the fall-rise conveying possible
doubts or implications: the written version of
the line could easily be interpreted as friendly
or even joyful. Indeed, this predisposition was
clear in the responses of the Finnish test sub-
jects: in dialogue 5, nine speakers interpreted
the speaker as friendly (one chose the label
joyful), and in dialogue 7, eight informants
chose friendly, one joyful and one casual.
The British informants reactions differ mar-
kedly from those of the Finnish listeners. In di-
alogue 5, an ironical attitude was detected by
six informants, a reserved attitude by three,
and a casual attitude by one. I dialogue 7,
most of the informants (seven listeners) heard
an ironical attitude, while the rest interpreted
the speaker as reserved.
Apparently, in dialogues 5 and 7, the Finns
did not perceive any potential conflict with the
lexical meaning and the tone choice: the fall-
rise did not detract from the general positive
attitude expressed verbally. By contrast, the na-
tive speakers were obviously aware of the clash
between lexical meaning and the attitude con-
veyed by intonation. Interestingly, many of the
informants thought that the line was meant to
be ironical: the fall-rise was probably perceived
as being out of place in an otherwise positive
part of dialogue, and the mismatch was attri-
buted to an ironical attitude. The situation here
may be partly similar to the example given by
Watt (1994). If the following utterance is ac-
companied by a smiling voice quality and a
gentle low rising intonation, a mordant effect of
sarcasm or irony is probably created:
Put that goddam pipe away
Incongruous linguistic content and intonation
conspire to produce a stylistic effect which is
likely to irritate or even unsettle the listener. As
Watt points out, intonation has an interperson-
al metafunction by serving as a channel for
linguistic expression of attitude. Lexicon and
(stylistic) register are other channels, and the
interaction of these modes creates attitudinal
meanings of different kinds.
Discussion
The investigation has revealed some interesting
differences in the semantic interpretation of the
fall-rise contour between native and non-native
speakers of English. The results support the
common view that the general pragmat-
ic/semantic meaning of the fall-rise can be de-
scribed in terms of such attitudinal labels as
reserved and doubtful: the native speakers
systematically associated reservations with the
tone when the lexical content was either neu-
tral or congruous with such an interpretation.
If there was a mismatch between words and the
tone, the clash was mainly interpreted as irony.
The British informants apparently had a very
clear idea about where the fall-rise fits in prop-
erly and where it is used for a deliberate pho-
nostylistic effect. The British informants could
thus analyze the meaning of the fall-rise also at
a metalinguistic level.
The Finnish informants mainly resorted to
the so-called lexico-syntactic strategy (see e.g.
Cruz-Ferreira 1986): speakers of a second lan-
guage analyze the (semantic/pragmatic) mean-
ing of an utterance as corresponding to the most
immediate interpretation of the lexical and
grammatical content of the sentence.
The conclusions drawn on the basis of this
investigation are supported by a study of the
productive English intonation skills of Finns:
Toivanen (2001) offers empirical evidence that
Finnish speakers of English very rarely use the
falling-rising tone in conversation. Thus, al-
though Finns can make, phonetically and pho-
nologically, a distinction between falling and
rising intonation, Finns are very hesitant about
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
associating rising tones with informational
and/or pragmatic openness. In the colloquial
English speech of Finns, reserved or incom-
plete statements are typically accompanied by
falling tones in contradistinction to the Eng-
lish spoken by native speakers.
Conclusion
This investigation, although based on limited
and somewhat artificial material, suggests that
even very advanced Finnish speakers of English
do not fully master the intonational lexicon of
English. Finns are largely unaware of the prag-
matic meaning of the fall-rise intonation con-
tour, and analyze the tone as conveying reser-
vations only when the lexical meaning allows
for such an interpretation. The British infor-
mants readily perceive the reserved meaning of
the fall-rise. However, if the context suggests
an entirely different semantic interpretation, the
native speakers are likely to conclude that the
contrast between the lexical meaning and the
fall-rise indicates irony of some kind.
Notes
1.
John: What do you think of her?
Bill: Shes got a good degree.
Bill sounds a) friendly, b) reserved, c) bored, d)
joyful, e) casual, f) ironical.
2.
John: What are you doing here?
Bill: Just waiting for Tim. He seems to be late
for the meeting.
Bill sounds a) friendly, b) reserved, c) bored, d)
joyful, e) casual, f) ironical.
3.
John: Do you think shes qualified for the job?
Bill: I dont know. Shes got a good degree. But
she hasnt got much experience.
Bill sounds a) friendly, b) reserved, c) bored, d)
joyful, e) casual, f) ironical.
4.
John: Excuse me, how much is this magazine.
Theres no price tag on it.
Bill: But there must be a price tag.
Bill sounds a) friendly, b) reserved, c) bored, d)
joyful, e) casual, f) ironical.
5.
John: My daughter has just graduated from
university. Shes a lawyer now.
Bill: Im glad to hear that. Shes got a good de-
gree.
Bill sounds a) friendly, b) reserved, c) bored, d)
joyful, e) casual, f) ironical.
6.
John: Can I borrow your car tonight?
Bill: If you really need it I guess you can.
Bill sounds a) friendly, b) reserved, c) bored, d)
joyful, e) casual, f) ironical.
7.
John: Did you know that my new boss has got a
doctorate in electrical engineering?
Bill: Yes. Shes got a good degree. And shes
also a nice person.
Bill sounds a) friendly, b) reserved, c) bored, d)
joyful, e) casual, f) ironical.
8.
John: Hey, Bill. May I use your cell phone? I
seem to have misplaced mine.
Bill: By all means.
Bill sounds a) friendly, b) reserved, c) bored, d)
joyful, e) casual, f) ironical.
References
Brown G. and Yule G. (1983) Discourse Analy-
sis. Cambridge: Cambridge University
Press.
Cruttenden A. (1997) Intonation. Cambridge:
Cambridge University Press.
Cruz-Ferreira M. (1986) Non-native interpre-
tive strategies for intonational meaning: an
experimental study. In James A. and Leather
J. (eds) Sound Patterns in Second Language
Acquisition, 256-269. Dordrecht: Foris Pub-
lications.
Iivonen A. (1998) Intonation in Finnish. In
Hirst D. and DiCristo A. (eds) Intonation
systems. A survey of twenty languages, 311-
327. Cambridge: Cambridge University
Press.
Roach P. (1991) English phonetics and phonol-
ogy. A practical course. Cambridge: Cam-
bridge University Press.
Toivanen J. (2001) Perspectives on Intonation:
English, Finnish and English Spoken by
Finns. Frankfurt am Main: Peter Lang.
Watt D. (1994) The Phonology and Semology
of Intonation in English. Bloomington, Indi-
ana: Indiana University Linguistics Club
Publications.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Measures of Continuous Voicing related to
Voice Quality in Five-Year-Old Children
Mechtild Tronnier
1
and Anita McAllister
2
1
Department of Culture and Communication, University of Linkping
2
Department of Clinical and Experimental Medicine, University of Linkping
Abstract
The present investigation pursues the question
whether the perceptual judgement of a childs
voice to sound hoarse can be correlated to the
degree of non-periodical sections in the signal,
based on the lack of regular oscillation of the
vocal folds. The results show that this is not the
case: children with clearly hoarse voices pro-
duce a stable and measurable fundamental fre-
quency. In addition, the recording containing
the highest number of periodicity is rated as the
roughest voice and children who show a ten-
dency toward unstable fundamental frequency
are not perceptually evaluated as being par-
ticularly hoarse.
Introduction
A hoarse voice consists of an assembly of de-
viations of several perceptual voice quality di-
mensions. The prominent parameters are:
roughness, breathiness and hyperfunction. Ac-
cording to perceptual evaluations, hoarseness in
adult voices is dominated by the feature rough-
ness, followed by breathiness and hyperfunc-
tion. In ten-year-old children however, hyper-
functionality and breathiness are the main per-
ceptual features of a hoarse voice, with minor
contributions of roughness (Sederholm et al.
1993, Sederholm 1995).
Acoustically a rough voice is typically char-
acterised by irregular phonation in the temporal
dimension (jitter) and in the intensity dimen-
sion (shimmer) (Laver, 1980). When character-
ising the voice quality of a healthy speaker with
a subtle tendency toward a hoarse voice, irregu-
larities usually are very small, but however pre-
sent and perceivable. For a more severe degree
of hoarseness irregularities are more prominent.
In the case of hoarseness due to an extraordi-
nary cold, voicing can be absent, i.e. no peri-
odic signal is produced during a short sequence
of speech. A voice judged as rough is therefore
expected to correlate with a minor scope of pe-
riodicity.
As a breathy voice is characterized as weak
and ineffective with incomplete closure leading
to a leakage but however consisting of regular
phonation, no correlation between the degree of
breathiness and the lack of periodicity in the
signal is expected to occur.
A hyperfunctional voice is not recognised in
terms of high levels of jitter and shimmer as
analyses of perturbation show highly regularity
in both dimensions (Klingholz & Martin, 1985;
McAllister et al. 1998). Spectral measures
like a low level of the fundamental frequency
relative to the first formant may be more ap-
propriate.
The main question underlying the present
investigation is therefore to what extent chil-
drens voices judged by professional speech
pathologists as predominantly being hoarse
show absence of periodicity in the produced
speech signal. Hereby, the correlation between
the proportion of periodicity and the three
dominant parameters are taken into considera-
tion.
Making use of the degree of measurable
fundamental frequency as a method of voice
stability assessment in children is a further as-
pect of discussion in this investigation. The in-
tention is to test an alternative method to the
analysis of perturbation since experience shows
that the available systems often are standardised
for adult voices and not suitable for tracking the
high fundamental frequency in childrens
voices.
Material and method
The material investigated in the present study is
part of the data gathered for the project Barn
och buller (Children and noise), which is a co-
operation between the University of Linkping
and KTH, Stockholm, within the BUG project
(Barnrstens utveckling och genusskillnader;
Child Voice Development and Gender Differ-
ences;http://www.speech.kth.se/music/projects/
BUG/abstract.html). It consists of recordings
from eleven five-year-old children, attending
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
three different pre-schools in Linkping. These
children were recorded using a binaural tech-
nique three times during one day at the pre-
school: at arriving in the morning and gather-
ing, during lunch and in the afternoon during
play time. At the beginning of each recording
session the children were ask to repeat the fol-
lowing phrases three times: En bl bil. En gul
bil. En rd bil. The repetitions of these
phrases, which consists of voiced phonemes
only, establish the base for the current investi-
gation.
In an earlier study these phrase repetitions
were used to perceptually assess the degree of
hoarseness, breathiness, hyperfunction and
roughness by three professional speech pa-
thologists (McAllister et al., in press). Assess-
ment was carried out by marking the degree of
each of the four voice qualities plus an optional
parameter on a Visual Analog Scale (VAS).
The averaged VAS-ratings by the speech pa-
thologists for each individual and each percep-
tual parameter in the recordings were used in
the present investigation.
These recordings were used for an acoustic
analysis of sustained voicing. However, the ma-
terial was altered in that all pauses and silences
not belonging to a phonologically voiced parts
of the utterance were cut out. An analysis of the
fundamental frequency was then carried out in
PRAAT with a sampling rate of 100Hz. The
scope of voicing (in %) was obtained from the
number of obtained F0-values analysed in
PRAAT in relationship to the expected number
of F0-values consistent with the length of the
utterance.
A Pearson Product-moment correlation
was carried out between the scope of voicing
and the average rating of the different types of
voice quality.
Results
The results are shown in Table 1 and in the
Figures 1 to 4. Table 1 presents the scope of
voicing (in %) for each recording and the aver-
age rating for the comprehensive voice quality
hoarseness and the more specific voice quality
features breathiness, hyperfunction and rough-
ness (in mm VAS-scale). Furthermore the mean
and the standard deviation over the different
recordings for the voicing rate and perceptual
voice quality is presented in the table. Marked
cells refer to noteworthy values which are taken
up in the discussion.
The figures show the distribution of the
scope of voicing in relationship to different
types of voice qualities. The included trendline
reflects the correlation between the scope of
voicing and the degree of deviating voice qual-
ity. In addition, the correlation coefficient
(squared r-value) is presented in each diagram,
which underlies the trendline.
Table 1. The scope of voicing and the average rat-
ing of the different voice qualities are shown for all
recordings.
recor
ding
taken
F0
(in %)
hoarse-
ness
(in mm
VAS)
breathi
ness
(in mm
VAS)
hyper-
func-
tion
(in mm
VAS)
rough-
ness
(in mm
VAS)
101 88.09 14.5 20 13 3.5
201 86.53 15 7.5 9.5 1
301 90.65 13.5 12 11 4
102 94.99 7.5 15.5 3.5 1
202 98.89 24 23.5 25.5 0.5
302 93.66 12 16 13 1.5
103 99.90 33.5 28.5 49.5 16
203 96.99 29 15 34.5 1.5
303 93.19 38.5 38.5 37.5 0.5
104 89.21 41.5 37.5 28.5 2
204 84.90 29 21 27.5 1.5
304 98.13 51 40 52 1.5
105 94.5 20.5 29 14 1
205 93.7 18.5 21 18.5 2
305 98.04 10 13.5 10.5 1
106 84.84 26 33.5 2 0
206 84.68 13.5 20 12 1
306 85.79 24 28.5 2 1
107 88.2 53 44 57 1.5
207 88.22 81 64.5 76.5 6.5
307 81.40 72.5 71 79.5 5.5
108 92.37 19 18.5 9 0.5
208 94.5 14 18 1.5 0.5
308 96.17 14 17.5 3 0
109 90.50 21.5 23 26.5 0.5
209 97.53 23 27 44.5 1.5
309 95.19 29.5 21 32.5 0.5
110 93.62 16.5 11.5 35.5 1.5
210 95.88 21 25.5 36 0.5
310 88.05 25.5 13.5 24.5 1.5
111 65.31 29.5 29 2.5 1.5
211 82.77 26 27.5 10.5 0.5
311 78.59 28.5 26.5 3.5 0.5
mean
90.45 27.15 26.02 24.44 1.92
sd 7.11 16.71 13.87 21.03 2.91
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
R
2
= 0,0439
0
10
20
30
40
50
60
70
80
90
60 70 80 90 100
F0 taken %
h
o
a
r
s
e
n
e
s
s
hoarseness
Linjr (hoarseness)
Figure 1. Correlation between the scope of voicing
(in %) and degree of hoarseness (in mm VAS).
R
2
= 0,0334
0
10
20
30
40
50
60
70
80
90
60 70 80 90 100
F0 taken %
h
y
p
e
r
f
u
n
c
t
i
o
n
Hyperfunction
Linjr (Hyperfunction )
Figure 3. Correlation between the scope of voicing
(in %) and degree of hyperfunction (in mm VAS).
Discussion and Conclusions
There is generally a high degree of voicing for
all recordings and no great variation between
the different voices and recordings.
As the trendlines and the correlation coeffi-
cients show, no significant correlation (neither
positive nor negative) between the scope of
voicing and the degree of perceived voice qual-
ity can be found.
However, an interesting parallelism can be ac-
counted for concerning the trendlines in Figure
1 and Figure 2, which has a negative orientation
in both cases. It can be interpreted that the rela-
tionship between the scope of voicing and
hoarseness is similar to the relationship be-
tween scope of voicing and one of the features
of hoarseness: i.e. breathiness. This parallelism
is not found for the other features and shows
R
2
= 0,0619
0
10
20
30
40
50
60
70
80
90
60 70 80 90 100
F0 taken %
b
r
e
a
t
h
i
n
e
s
s
Breathiness
Linjr (Breathiness)
Figure 2. Correlation between the scope of voicing
(in %) and degree of breathiness (in mm VAS).
R
2
= 0,0137
0
2
4
6
8
10
12
14
16
18
60 70 80 90 100
F0 taken %
r
o
u
g
h
n
e
s
s
roughness
Linjr (roughness)
Figure 4. Correlation between the scope of voicing
(in %) and degree of roughess (in mm VAS).
that breathiness is a stronger feature for a five-
year-old childs voice to sound hoarse then the
other two features. In an earlier study for ten-
year-old children however, hyperfunctionality
was found to contribute to the perception of
these children as being hoarse as much as the
feature breathiness.
Recordings of the children with the highest
level of hoarseness show stable periodicity in
the production of fundamental frequency (cf.
Table 1, recording 207 and 307). For those re-
cordings high ratings of breathiness and hyper-
function were given, however rather low ratings
for the feature roughness, which is closely re-
lated to irregular periodicity in voices. When
looking closer at how these children act when
participating in the recordings, one can surely
state that they produce very lively speech, loud
and within a wide frequency range. It is obvi-
ous also from the more extended material that
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
these children are very active speakers but their
voicing does not fail despite the strong degree
of perceived hoarseness.
The child who showed a very low degree of
measured fundamental frequency (cf. Table 1,
recording 111) shows a low degree of rough-
ness, which would be the appropriate feature to
represent aperiodicity. Breathiness is graded as
similarly strong as the degree of hoarseness,
however on a moderate level. This child is not
behaving quite as lively as the children in the
recordings mentioned before and speaks in a
rather quiet and more monotonous voice, some-
times even whispers. This is the case for the
section included in the investigation material
and also in the following part of the recording,
containing spontaneous speech.
It is also noteworthy that the recording with
the broadest scope of voicing (103) shows the
highest degree of roughness among the re-
cordings. Although the value is relatively low
when compared to the ratings of the other fea-
tures, this stands in contradiction to the as-
sumption that roughness is based on instable
voicing. This voice is also rated as moderately
hyperfunctional, but has a lower rating for the
overall impression to sound hoarse. The ratings
of this recording show that either the idea of
making use of calculating the scope of periodic-
ity in the way it has been done in this investiga-
tion to account for instability in a childs voice
has to be taken with caution or the relationship
between the evaluation of a childs voice to
sound rough and the degree of irregularity has
to be studied more closely.
One has to conclude that the degree of ab-
sence of measurable fundamental frequency is
not related to any of the types and the degree of
several voice qualities like hoarseness, breathi-
ness, hyperfunction or roughness in five-years-
old children, but rather on behavioural factors,
Therefore our conclusion is that this measure
does not offer an alternative to the analysis of
perturbation.
For a clear picture of what acoustically de-
notes a childs voice to sound hoarse obviously
other aspects e.g. spectral aspects have to be
taken into consideration.
References
Klingholz F, and F. Martin (1985). Quantitative
spectral evaluation of shimmer and jitter. J
of Speech and Hearing Research 28, 169-
74.
Laver, J. (1980) The phonetic description of
voice quality. Cambridge University Press.
McAllister, A., Sundberg, J, Hibi S. (1998):
Acoustic measurements and perceptual
evaluation of hoarseness in childrens
voices. Log Phon Vocol, 23, 27-38.
McAllister, A., S. Granqvist, P. Sjlander and J.
Sundberg (in press) Child voice and noise: a
pilot study of noise in day-cares and the ef-
fects on ten childrens voice quality accord-
ing to perceptual evaluation.
Sederholm E., A. McAllister, J. Sundberg and
J. Dalkvist (1993) Perceptual analysis of
child hoarseness using continuous scales.
Scand Journal of Logopedics and Phoniat-
rics 18, 7382.
Sederholm, E. (1995) Prevalence of hoarseness
in ten-year-old children. Scand Journal of
Logopedics and Phoniatrics 20, 165173.
Yumoto, E. Y. Sasaki and H. Okamura (1984)
Harmonic-to-noise ratio and psychophysical
measurement of the degree of hoarseness.
Journal of Speech and Hearing research 27,
26
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
On final rises and fall-rises in German and Swedish
Gilbert Ambrazaitis
Linguistics and Phonetics, Centre for Languages and Literature, Lund University
Abstract
This study explores the intonational signalling
of a request address in German and Swedish.
Data from 16 speakers (9 Germans, 7 Swedes)
were elicited under controlled conditions, and
intonation contours produced on the test phrase
Wallander? were classified according to
their phrase-final pattern. Both rises and
fall-rises were produced frequently by both
Germans and Swedes, which is in line with
Ohalas frequency code, but challenging for the
Lund model of Swedish intonation.
Introduction
The tonal system of Swedish is usually said to
differ largely from that of otherwise closely re-
lated languages such as German, Dutch, or Eng-
lish. One reason for this conception is, of
course, the presence of the tonal word accents
in Swedish, which are absent in the standard
variety of, e.g., German. But the difference be-
tween the intonational systems of German and
Swedish, as they have been described in the lit-
erature, goes far beyond the presence or ab-
sence of lexical tonal phenomena, respectively.
Table 1 displays one example each of phono-
logical accounts of Swedish and German into-
nation: the Lund model for Swedish (Bruce,
1998; 2005), and GToBI for German (Grice et
al., 2005). They have been chosen because both
are contemporary and formulated in terms of
autosegmental-metrical (AM) phonology, i.e.,
they should be formally comparable.
Table 1. Accents and final boundary tones (b.t.) in
GToBI for German (Grice et al. 2005) and the Lund
model for Swedish (Bruce 1998; 2005).
Standard German Standard Swedish
function accents b. t. accents b. t.
lexical H+L*
H*+L
non-
lexical
H*
L+H*
L*
L*+H
H+L*
H+!H*
L-
H-
L-%
L-H%
H-%
H-H%
H-
L%
LH%
According to Table 1, Swedish and German dif-
fer not only with respect to lexical, but also
largely with respect to non-lexical, or utterance-
related, tonal features: While German has six
different accents on the utterance-level, Swed-
ish has only one, known as the focal accent. A
similar relation holds for final boundary tones.
But the conclusion that Swedish has a much
poorer utterance prosody than German may,
of course, only be drawn under the premise that
the two models in Table 1 are (a) adequate and
(b) equivalent, in the sense that they have been
developed under equivalent conditions. How-
ever, it may be argued that Swedish and Ger-
man intonation research are characterized by
different preconditions and traditions to the ex-
tent that a formal comparison even of contem-
porary models does not reveal any reliable in-
formation on actual differences between the in-
tonational systems of the two languages.
This study is part of a larger comparative
project on Standard Swedish and Standard
German intonation, from a communicative-
functional perspective. Its general hypothesis is
that there are more similarities than indicated
by contemporary models (cf. Table 1). The gen-
eral method is to elicit certain utterance types,
or speech acts, defined by constructed (but real-
istic) discourse contexts, in both Swedish and
German, keeping the material, the situational
context, and the recording conditions as con-
stant as possible.
This paper deals with one such utterance
type, which may be labelled a request address
as exemplified by Wallander? in the follow-
ing situational context: A police officer from
Ystad (Southern Sweden) to his colleague:
Wallander? Would you mind if I asked you for
a favour? The goal of this paper is to gain a
preliminary impression of the intonation pat-
terns used by Germans and Swedes in such re-
quest addresses. For that, a classification of the
obtained intonation contours is undertaken, and
the distribution of patterns, as well as the pho-
netic form of the most frequent patterns, is
compared for Swedish and German. The classi-
fication concentrates on the phrase-final accent
pattern, or the nuclear tune in the British tra-
dition, defined as the last (in this study, the only
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
one present) pitch accent in an intonation
phrase plus the final boundary tone (cf. next
section).
Phrase-final intonation patterns
For German, a large variety of phrase-final in-
tonation patterns exists according to Table 1.
For the purpose of this study, however, a less
detailed classification will suffice: The nuclear
pattern is either a fall, a rise, or a fall-rise.
A fall has a high stressed or post-stress sylla-
ble and a low boundary tone (e.g., (L+)H* L-%,
L*+H L-%); a rise has a boundary tone higher
than the last accentual tone (e.g., L* H-%, H*
H-^H%); and finally, a fall-rise has a high
stressed or post-stress syllable, and a low-high
sequence as a boundary tone (e.g., H* L-H%).
For Swedish, no such three-fold contrast has
been described. The focal accent H- always in-
volves either a high stressed syllable (words
with accent I), or a high tone later in the word
(words with accent II). Combining this H- with
the two possible boundary tones in Table 1 re-
sults in a fall (H- L%), or a fall-rise (H-
LH%), respectively. That is, a rise (at least
one connected to utterance-level prominence,
cf. discussion), as defined for German above, is
not recognized by the Lund model for Swedish.
Final rises (or fall-rises) are often associated
with the notion of question intonation or with
continuation in a variety of languages. For
German, e.g., question and continuation in-
tonation seem to differ in range and shape of
the rise (Dombrowski and Niebuhr, 2005). Syn-
tactic factors have some influence on whether a
German question is falling or rising, but in gen-
eral, in accordance with Ohalas (1984) fre-
quency code, a rise signals a greater subordina-
tion of the enquirer towards the addressee
(Kohler, 2005). Rising intonation may thus
more frequently be found in connection with
polite questions. In Swedish, questions are
typically said not to be marked by final rises
(Grding, 1979). The rising boundary tone
LH% of the Lund model has in fact hardly been
discussed from a functional perspective; one
function that has been mentioned is the signal-
ling continuation (Gussenhoven, 2004).
Hypothesis
According to the contemporary descriptions,
Swedish and German intonation patterns should
be expected to differ in the expression of a re-
quest address. Considering a request address
as some kind of polite question, or at least a
function connected with a subordination of the
speaker towards the addressee, one would ex-
pect a rise, or possibly a fall-rise for German.
For Swedish, on the other hand, a fall and a
fall-rise are the only patterns offered by the
Lund model, where the fall-rise is not associ-
ated with question intonation.
Method and materials
German and Swedish subjects were asked to
read test utterances from a computer screen in
an experimental studio at the Humanities Labo-
ratory at Lund University. All utterances consti-
tuted parts of constructed dialogues. For each
test item, a short text describing a situational
context was displayed on the screen, followed
by the test utterance. The speakers were asked
to render the test utterance as natural as possi-
ble. Five repetitions of each item were re-
corded. There were 13 test items in total, and
the whole list of 65 items was randomized. So
far, 7 speakers of Standard Swedish (4 female),
and 9 speakers of Standard German (6 female)
have been recorded.
The test material of this study consists of
one of the 13 items, the one-word phrase Wal-
lander?, both for German and for Swedish. It
constituted the first part of the test utterances
Wallander? Skulle jag kunna f be dig om en
tjnst? (Swedish), and Wallander? Drfte
ich Sie um einen Gefallen bitten? (German).
The database of this study consists of all 5 repe-
titions by all 16 speakers, hence 80 renderings
of Wallander?, 35 by Swedish, and 45 by
German speakers.
As a first step in data analysis, all intonation
contours were categorized according to their
phrase-final patterns as described above. The
classification was done manually (by inspect-
ing the F0 contours, auditorily and visually) by
the author. In a second step, differences be-
tween the German and Swedish realizations of
the categories obtained in step 1 were looked
for. For that, each token of Wallander? was
segmented into 5 units corresponding to /(v)a/,
/l/, /a/, /nd/, /de(r)/. The segmentation was done
manually with the help of a spectrogram. All
segments were fully voiced; initial fricatives (as
possible realization of /v/) were, if present, ex-
cluded from the initial segment. The boundary
between /nd/ and /de(r)/ was set immediately
before the plosive burst of /d/. For the purpose
of visual comparison, F0 contours were time-
normalized, by representing each of the 5 seg-
ments by 10 equidistant F0 measurements.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Table 2. Distribution of nuclear intonation patterns
by Swedish and German speakers. N = absolute
number of items; Speak. = speakers who (at least
once) produced a pattern; the first letter in speaker
label indicates sex (M = male; F = female).
German Swedish
% N Speak. % N Speak.
Fall 13.3 6 Mmk; Mas 17.1 6 Fss; Mnh
Fall-rise 17.8 8 Fjd; Fll: Fcf 42.9 15 Fkb; Fcw;
Mmr
Rise 66.7 30 Fib; Fjd;
Fkm; Fmt;
Fcf; Mms;
Mas
25.7 9 Mmu; Mnh
Unclear 2.2 1 Fll 2.9 1 Fek
Other 0.0 0 - 11.4 4 Fek
Sum 100 45 9 100 35 7
0
5
10
15
20
25
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52
normtime
s
e
m
i
t
o
n
e
s
German FR (N=8) Swedish FR (N=15)
German R (N=30) Swedish R (N=9)
Figure 1. Average F0 contours in semitones (0
semitones set to 100 Hz) of rises (R) and fall-rises
(FR) by German and Swedish speakers. Time is
normalized (10 data points per segment); the
breaks in the curves indicate segment boundaries;
vertical lines mark the vowel of the stressed sylla-
ble. Observe that the curves are based on produc-
tions from speakers of different sex: German FR
(female); German R (female and male); Swedish
FR (female and male); Swedish R (male).
Results
In most cases, a classification as either fall, rise,
or fall-rise was unproblematic, since these con-
tours were produced rather prototypically.
There were only two cases (one for each lan-
guage), where a decision between fall or fall-
rise was problematic (there was a slight rise of
less than 1 semitone). Four patterns (all by the
same female Swedish speaker Fek) were classi-
fied as other: Two were actually falling, but
lacked the typical rising focal accent H-, and
two cases exhibited a high-level monotone
throughout the word.
Distribution of patterns
Table 2 displays the distribution of the patterns
obtained for Swedish and German speakers. It
does, however, not include exact information
on how the N occurrences of a particular pat-
tern are distributed over the speakers listed un-
der Speak.. Most speakers in fact produced
the same pattern type in all of their 5 repeti-
tions. Only 4 of the German (Mas, Fjd, Fll, Fcf)
and 2 of the Swedish speakers (Mnh, Fek) oc-
casionally produced different pattern types.
Table 2 shows that each of the three nuclear
intonation patterns (fall, rise, fall-rise) was pro-
duced by at least two speakers of each lan-
guage. However, the German speakers most
frequently chose a (simple) rising pattern, while
the Swedes seemed to prefer a fall-rise. But
note that the fall-rise was actually produced by
only 3 of the 7 Swedish speakers, while the
(simple) rise is distributed over 7 of 9 German
speakers. In order to test for an interaction of
language and preference for either a rise or a
fall-rise, the data were re-arranged as follows: If
a speaker had produced pattern X in at least 3
(of 5) repetitions, s/he was classified as a X
speaker. This arrangement is shown in Table 3.
Table 3. Number of German and Swedish speakers
who preferred either a rise or a fall-rise.
German Swedish sum
Fall-rise speakers 2 3 5
Rise speakers 6 2 8
Sum 8 5 13
Fishers exact test, however, revealed that the
interaction between language and preference for
either a rise or a fall-rise, which is slightly indi-
cated by the data, is not significant (p=.2494).
Contour shape of rises and fall-rises
Figure 1 displays the mean F0 contours of the
rises and the fall-rises in semitones as produced
by the relevant German and Swedish speakers
(cf. Table 2). Since F0 was not speaker-
normalised and the curves emerge from speak-
ers of different sex, the absolute height of F0
values should be ignored. However, F0 move-
ments, or relative heights, may be compared,
since the F0 measure used is logarithmic.
Both for the rise and for the fall-rise there
appears to be at least one salient difference be-
tween the Swedish and German productions:
As for fall-rises, the final rise spans about 5
(v)a l a nd de(r)
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
semitones for both Germans and Swedes, but
the relative height of the end point differs cru-
cially: it is higher than the accentual F0 peak
for German, but lower than the accentual (fo-
cal) peak for Swedish. Furthermore, the accent
peak appears to be timed somewhat later in the
German compared to the Swedish data. As for
the rises, there is a pronounced tonal step down
from the pre-stress to the low stressed syllable
in the Swedish productions, which is very small
in the German data. There was only little varia-
tion among the repetitions within each category
and language regarding these characteristics.
Discussion
In this study, at least two different pattern types
a rise and a fall-rise resulted from the elici-
tation of a function labelled request address.
Whether these two types represent two equiva-
lent strategies for expressing the same function,
or whether they in fact express different func-
tional nuances that have not been controlled,
will have to be tested in future research.
However, the results indicate that some
form of rise (rise or fall-rise) is the most
frequently occurring final pattern in connection
with a request address, both in German and in
Swedish. This is in line with the frequency code
(Ohala, 1984), which associates high or rising
pitch with the expression of subordination, in
contrast to low or falling pitch, signalling domi-
nance. However, the result challenges the Lund
model, since the rising boundary tone (LH%)
offered by the model has so far not been associ-
ated with the signalling of subordination.
The most salient difference found between
the German and the Swedish data concerns the
step from the pre-stress to the stressed syllable
in the rise patterns, which is very pronounced
in the Swedish, but very small, if not absent, in
the German data. This is in line with the Lund
model, which predicts such an early fall for
accent I when a rising focal accent (H-) is miss-
ing. In fact, the only possibility of the Lund
model to deal with these rises is to describe
them as a non-focal accent I plus rising bound-
ary tone (H+L* LH%).
However, utterances of the type discussed
here (request address) have traditionally not
been within the scope of the Lund model,
which is actually based on statements only.
These may be realized by several prosodic
phrases, and the Lund model assumes that each
of such phrases contains at least one focal ac-
cent (actually, the term phrase accent, as used
earlier, would be more appropriate, since a
statement consisting of two phrases with one
phrase accent each could still have only one
word in narrow focus).
Thus, the Lund model analysis H+L* LH%
for the rise is problematic, since it renders a
phrase lacking any phrase accent. In a com-
parison with other Germanic languages, the
original assumption by the Lund model is plau-
sible, since at least one word in a (non-
interrupted) phrase in, e.g., German and Eng-
lish, is always conceived of as accented (nec-
essarily on the phrase/utterance level). It has
been argued that Swedish, like German, has an
early falling accent on the utterance-level as
well, which is used in the expression of con-
firmation (Ambrazaitis, 2007). The present
Swedish data could also be analyzed in the light
of this earlier finding, i.e., as instances of a low
utterance-level accent, which exists besides the
classical rising focal accent H-.
References
Ambrazaitis G. (2007) Expressing confirma-
tion in Swedish: the interplay of word and
utterance prosody. Proc. 16th ICPhS (Saar-
brcken, Germany), 10931096.
Bruce G. (1998) Allmn och svensk prosodi.
Lund: Institutionen fr Lingvistik.
Bruce G. (2005) Intonational prominence in va-
rieties of Swedish revisited. In Jun, S.-A.
(ed) Prosodic Typology: The Phonology of
Intonation and Phrasing, 410429 . Oxford:
OUP.
Dombrowski E. and Niebuhr O. (2005) Acous-
tic patterns and communicative functions of
phrase-final F0 rises in German: Activating
and restricting contours. Phonetica 62, 176
195.
Grice M., Baumann S., and Benzmller R.
(2005) German intonation in autosegmental-
metrical phonology. In Jun, S.-A. (ed) Pro-
sodic Typology: The Phonology of Intona-
tion and Phrasing, 5583 . Oxford: OUP.
Gussenhoven C. (2004) The Phonology of Tone
and Intonation. Cambridge: CUP.
Grding E. (1979) Sentence intonation in
Swedish. Phonetica 36, 207215.
Kohler K. (2005) Pragmatic and attitudinal
meanings of pitch patterns in German syn-
tactically marked questions. AIPUK (Ar-
beitsberichte IPdS Kiel) 35a, 125142.
Ohala J.J. (1984) An ethological perspective on
common cross-language utilization of F0 of
voice. Phonetica 41, 116.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
SWING:
A tool for modelling intonational varieties of Swedish
Jonas Beskow
2
, Gsta Bruce
1
, Laura Eno
2
, Bjrn Granstrm
2
, Susanne Schtz
1
(alphabetical order)
1
Dept. of Linguistics & Phonetics, Centre for Languages & Literature, Lund University
2
Dept. of Speech, Music & Hearing, School of Computer Science & Communication, KTH
Abstract
SWING (SWedish INtonation Generator) is a new
tool for analysis and modelling of Swedish in-
tonation by resynthesis. It was developed in or-
der to facilitate analysis of regional varieties,
particularly related to the Swedish prosody
model. Annotated speech samples are resynthe-
sized with rule based intonation and audio-
visually analysed with regard to the major in-
tonational varieties of Swedish. We find the tool
useful in our work with testing and further de-
veloping the Swedish prosody model.
Introduction and background
Our object of study in the research project
SIMULEKT (Simulating Intonational Varieties
of Swedish, supported by the Swedish Research
Council) (Bruce et. al, 2007) is the prosodic
variation characteristic of different regions of
the Swedish-speaking area, shown in Figure 1.
Figure 1. Approximate geographical distribution of
the seven main regional varieties of Swedish.
The seven regions correspond to our present
dialect classication scheme. In our work, the
Swedish prosody model (Bruce & Grding,
1978; Bruce & Granstrm, 1993; Bruce, 2007)
and different forms of speech synthesis play
prominent roles. Our main sources for analysis
are the two Swedish speech databases Speech-
Dat (Elenius, 1999) and SweDia 2000 (Eng-
strand et.al, 1997). SpeechDat contains speech
recorded over the telephone from 5000 speak-
ers, registered by age, gender, current location
and self-labelled dialect type. The research pro-
ject SweDia 2000 collected a word list, an elic-
ited prosody material and extensive spontane-
ous monologues from 12 speakers (younger and
elderly men and women) each from more than
100 different places in Sweden and Swedish-
speaking parts of Finland, selected for dialectal
speech.
The Swedish prosody model
The main parameters for the Swedish prosody
model are for word prosody 1) word accent
timing, i.e. timing characteristics of pitch ges-
tures of word accents (accent I/accent II) rela-
tive to a stressed syllable, and 2) pitch patterns
of compounds, and for utterance prosody 3) in-
tonational prominence levels (focal/non-focal
accentuation), and 4) patterns of concatenation
between pitch gestures of prominent words.
Background
An important part of our project work concerns
auditive and acoustic analysis of dialectal
speech samples available from our two exten-
sive speech databases described in Section 1.
This work includes collecting empirical evi-
dence of prosodic patterns for the regional va-
rieties of Swedish described in the Swedish
prosody model, as well as identifying intona-
tional patterns not yet included in the model.
To facilitate our work with testing and further
developing the model, we needed a tool for
generating rule-based intonation.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Figure 2. Schematic overview of the SWING tool.
Figure 3. Example of an annotated input speech.
Design
SWING comprises a number of parts, which are
joined by the speech analysis software Praat
(Boersma & Weenink, 2007), also serving as the
graphical interface. Annotated speech samples
and rules for generating intonation are used as
input to the tool. The tool generates and plays
resynthesis with rule-based and speaker-
normalised intonation of the input speech
sample. Additional features include visual dis-
play of the output on the screen, and options
for printing various kinds of information to the
Praat console (Info window), e.g. rule names
and values, or the time and F
0
of generated
pitch points. Figure 2 shows a schematic view
of the tool design.
Speech material
The input speech samples are annotated manu-
ally. Stressed syllables are labelled prosodically
and the corresponding vowels are transcribed
orthographically. Table 1 shows the prosodic
labels used in the current version of the tool,
while Figure 3 displays an example utterance
with prosodic annotation: De p kvllarna
som vi snder Its in the evenings that we are
transmitting.
Table 1. Labels used for prosodic annotation of
the speech samples to be analysed by the tool.
Label Description
pa1 primary stressed (non-focal) accent 1
pa2 primary stressed (non-focal) accent 2
pa1f focal accent 1
pa2f focal accent 2
Rules
The Swedish prosody model is implemented as
a set of rule files one for each regional variety
in the model with timing and F
0
values for
critical points in the rules. These files are sim-
ply text files with a number of columns, where
the first one contains the rule names, and the
following columns contain three pairs of values,
corresponding to the timing and F
0
of equally
many critical pitch points of the rules. The
three points are called ini (initial), mid (medial),
and fin (final). They contain values for the tim-
ing (T) and F
0
(F0). Timing of F
0
points is ex-
pressed as a percentage into the stressed sylla-
ble, starting from the onset of the stressed
vowel. If no timing value is explicitly stated in
the rule, the pitch point is by default aligned
with the onset of the stressed vowel. Three
values are used for F
0
: L (low), H (high) and H+
(extra high, used in focal accents). The mid
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
pitch point is optional; unless it is needed by a
rule, its values may be left blanc. Existing rules
are easy to adjust, and new rules can be added.
Table 3 shows an example of the rules for
South Swedish. Several rules contain a second
part, which is used for the pitch contour of the
following (unstressed) interval (segment) in the
annotated input speech sample. This extra part
has next attached to its rule name. Examples
of such rules are pa1f and pa2f in Table 2.
The SWING tool procedure
Analysis with the SWING tool is fairly straight-
forward. The user selects one input speech
sample and one rule file to be used with the
tool, and which (if any) information about the
analysis (rules, pitch points, debugging infor-
mation) to be printed to the console. A Praat
script generates resynthesis of the input speech
sample with a rule based output pitch contour.
Generation of the output pitch contour is based
on 1) the pitch range of the input speech sam-
ple, which is used for speaker normalisation, 2)
the annotation, which is used to find the time
and prosodic gesture to generate, and 3) the rule
file, which is used for the values of the pitch
points in the output. The Praat graphical user
interface provides immediate audio-visual feed-
back of how well the rules work, and also al-
lows for easy additional manipulation of pitch
points with the Praat built-in Manipulation fea-
ture. Figure 4 shows a Praat Manipulation ob-
ject for an example utterance. The light grey line
under the waveform shows the original pitch,
while the circles connected with the solid line
represent the rule-generated output pitch con-
tour. In the Praat interface, the user can easily
compare the original and the resynthesized
sounds and pitch contours, and further adjust
or manipulate the output pitch contour (by
moving one or several pitch points) and the an-
notation files. The rule files can be adjusted in
any text editor.
Testing the Swedish prosody
model with SWING
SWING is now being used in our work with test-
ing and developing the Swedish prosody model.
Testing is done by selecting an input sound
sample and a rule file of the same intonational
variety. If the model works adequately, there
should be a close match between the F
0
contour
of the original version and the rule-based one
generated by the tool. Figure 5 shows examples
of such tests of an utterance in the Svea and
South Swedish varieties. Interesting pitch pat-
terns found in our material which have not yet
been implemented in the rules are also analysed
using the tool.
Table 2: Example rule le for South Swedish with timing (T) and F
0
(F0) values for initial (ini), mid
(mid) and nal (n) points.
Rule name iniT iniF0 midT midF0 nT nF0
global (phrase) L L
concatenation L L
pa1f (focal accent 1) -10 L 20 H+ 50 L
pa1f next (extra gesture) L L
pa2f (focal accent 2) L 40 L H+
pa2f next (extra gesture) H+ 30 L L
pa1 (non-focal accent 1) -30 L 10 H 40 L
pa2 (non-focal accent 2) L 50 L H
pa2 next (extra gesture) H 30 L L
Figure 4. Praat Manipulation display of a South Swedish utterance with rule-generated Svea intonation
(circles connected by solid line; original pitch: light-grey line).
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Discussion and future work
Although SWING still needs work, we already
find it useful in our project work of analysing
speech material as well as testing our model.
We consider the general results of our model
tests to be quite encouraging. The tool has so
far been used on a limited number of words,
phrases and utterances and with a subset of the
parameters of the Swedish prosody model, but
was designed to be easily adapted to further
changes and additions in rules as well as speech
material. We are currently including more
speech samples from our two databases, and
implementing other parameters of the Swedish
prosody model, such as rules for compound
words. Our near future plans include evaluation
of the tool by means of perception tests with
natural as well as rule-generated stimuli.
References
Boersma P. and Weenink D. (2007) Praat: do-
ing phonetics by computer (version 4.6.17)
[computer program]. http://www.praat.org/,
visited 12-Mar-08.
Bruce G. (2007) Components of a prosodic ty-
pology of Swedish intonation. In Riad T. and
Gussenhoven C. (eds) Tones and Tunes,
Volume 1, 113-146, Berlin: Mouton de
Gruyter.
Bruce G. and Grding E. (1978) A prosodic ty-
pology for Swedish dialects. In Grding E.,
Bruce G. and Bannert R. (eds) Nordic Pros-
ody, 219-228, Lund: Department of Linguis-
tics.
Bruce G. and Granstrm B. (1993) Prosodic
modelling in Swedish speech synthesis. A
prosodic typology for Swedish dialects.
Speech Communication 13, 63-73.
Bruce G., Granstrm B., and Schtz S. (2007)
Simulating intonational varieties of Swedish.
Proc. of ICPhS XVI (Saarbrcken, Ger-
many)
Engstrand O., Bannert R., Bruce G., Elert C-C.,
and Eriksson A. (1997) Phonetics and pho-
nology of Swedish dialects around the year
2000: a research plan. Papers from
FONETIK 98, PHONUM 4, Ume: De-
partment of Philosophy and Linguistics, 97-
100.
Elenius K. (1999) Two Swedish SpeechDat da-
tabases - some experiences and results. Proc.
of Eurospeech 99, 2243-2246.
Figure 5. Original and rule-based intonation of the utterance De p kvllarna som vi snder It s
in the evenings that we are transmitting for Svea and South Swedish (original pitch: dotted line;
rule-generated pitch: circles connected with solid line).
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Recognizing phrase and utterance as prosodic units in
non-tonal dialects of Kammu
Anastasia Karlsson
1
, David House
2
and Damrong Tayanin
1
1
Department of Linguistics and Phonetics, Lund University
2
Speech, Music and Hearing, KTH, Stockholm
Abstract
This paper presents a study of prosodic phras-
ing in a non-tonal dialect of Kammu, a Mon-
Khmer language spoken in Northern Laos.
Prosodic phrasing is seen as correlated with
syntactic and informational structures, and the
description is made referring to these two lev-
els. The material investigated comprises sen-
tences of different lengths and syntactic struc-
tures, uttered by seven male speakers. It is
found that, based on prosodic cues, the distinc-
tion between prosodic utterance and prosodic
phrase can be made. Prosodic phrase is sig-
naled by a sequence of low + high pitch while
the right edge of the prosodic utterance gets
low pitch. This low terminal is replaced by a
high terminal in expressive speech. The study is
performed using elicited speech.
Introduction
Kammu, a Mon-Khmer language, has dialects
with lexical tones (low and high lexical tone)
and dialects with no lexical tones. Tones arose
at a late stage of the languages development in
connection with loss of the contrast between
voiced and voiceless initial consonants in a
number of dialects (Svantesson and House,
2006). There are no other phonological differ-
ences between toneless and tonal dialects and
this makes the different Kammu dialects well
suited for studying and comparing the use of
phrase intonation.
In this paper we present an investigation of
prosodic phrasing in the non-tonal dialects of
Kammu. We concentrate on the use of pitch in
signaling prosodic grouping. It is assumed that
a spoken utterance can be prosodically signaled
as one prosodic unit or divided into smaller
prosodic units. We do not have any pre-
assumptions about the types and numbers of
prosodic units in Kammu. Instead we assume
that, due to the elicited type of the material, the
utterances are read as autonomous utterances,
and it is interesting to find out whether the
rightmost utterance boundary is signaled
prosodically differently from the utterance in-
ternal boundaries. Due to the SVO word order
Kammu typically places new information at the
right edge of the utterance.
Words in Kammu are monosyllabic or ses-
quisyllabic (Svantesson and Karlsson, 2004).
Sesquisyllabic words consist of one minor and
one major syllable. The minor syllable has
schwa as its nucleus. Schwa insertion is often
absent in casual speech, but appears consis-
tently in some types of singing (Lundstrm and
Svantesson, 2008). There is also a phonological
distinction between short and long vowels.
Method
Material
Material was collected in Laos in 2007. For the
present investigation 16 sentences with differ-
ent lengths and different syntactic structures
were chosen. Kammu lacks a written script and
informants were asked to translate the material
from Lao to Kammu. Kammu speakers are bi-
lingual with Lao being their second language.
This resulted in somewhat different but still
compatible versions of the target sentences. The
resulting utterances were checked and tran-
scribed by Damrong Tayanin who is a native
speaker of Kammu.
Each target sentence was read three times by
the speakers. A total of 226 utterances were in-
vestigated.
Subjects
A total of nine speakers, seven men and two
women were recorded. Their ages ranged from
14 to 57 years. For the present investigation
seven speakers, all men, were chosen. They are
labeled as S1, S3, S5, S6, S7, S8 and S9.
Recording and analysis
The subjects were recorded with a portable Edi-
rol R-09 digital recorder. Five of the speakers
were recorded in quiet hotel rooms, S3 was re-
corded in his native village, and S7 was re-
corded at his home.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
The recordings were analyzed using the
Praat program. For each utterance, an f
0
con-
tour was extracted. Main pitch features such as
turning points, lows and highs, relations be-
tween lows and highs specified by finding the
lowest and the highest point in the f
0
contours,
and shapes of pitch gestures (fall, rise or level
tone) were measured.
The observed tonal features were analyzed
by referring to the syntactic and information
structure of the utterances. Thus, the division of
the sentences into syntactic phrases (NP, VP
etc) and types of words (function- or lexical
words) were matched to the tonal events ob-
served. The placement of pragmatic focus was
unambiguous from the sentence contents.
Results
Prosodic utterance
Regarding the signaling of the right utterance
edge, the speakers can be divided into two
groups. Speakers in the first group (S1, S5, S7,
S9) have high pitch gestures utterance finally.
Speakers in the second group (S3, S6, S8) have
both low and high terminals. In the second
group, S3 and S6 have a prevalence for low
terminals and S8 has mostly high terminals.
In our previous investigation (Karlsson et
al., 2007), we found two main types of focal
accent in the non-tonal dialects. The material
comprised recordings made by Kristina Lindell
in the 1970s of one male speaker telling a folk-
tale. Based on the contents, we assumed that the
high focal accent was more expressive than the
low one. The present material gives more sup-
port to our assumption about the different prag-
matic load of the focal accents. Thus, we find
two different gestures, a falling pitch and a high
pitch in the same position in the utterance. This
is a default position for focus and both gestures
thus function as focal accent. The high focal
accent differs from the low one by its expres-
sive load. For instance, S3 is reserved when
reading the written material. He uses high ter-
minals only in some cases. On the other hand,
in his spontaneous storytelling, S3 is very re-
laxed and uses high terminals instead of low
ones.
A default pattern of a neutral utterance ut-
tered as one prosodic unit is a declining f
0
course with a low (falling) terminal. The low
terminal tone signals the right boundary of a
prosodic utterance. A prosodic phrase can not
have a low terminal (more on this in the next
section), until the phrase can (syntactically and
semantically) function as an autonomous utter-
ance.
Speakers with high terminals usually reach
the highest f
0
value in the utterance at the final
rise. This is typical even for utterances with
multiple focal accents, such as listings of ob-
jects that a person owns. Figure 1 illustrates
variants of the sentence hmra o d wt,
traak o d wt A horse Ill buy and a buffalo
Ill buy. In the top panel the f
0
course of S3 is
presented. The sentence is realised as two pro-
sodic utterances with low terminals. The utter-
ance boundaries are shown with arrows. In the
bottom panel, the f
0
course of S1 is presented.
This speaker is expressive and has high bound-
ary tones. He uttered the sentences as consisting
of four prosodic phrases, each ending with a
high tonal gesture (shown with arrows).
Figure 1. F0 courses of the sentence (its mutual
variants) hmra o d wt, traak o d wt A
horse Ill buy and a buffalo Ill buy uttered by S3
(the upper panel) and S1 (the bottom panel).
Prosodic phrase
When the utterance is divided into smaller pro-
sodic units (named prosodic phrases here), it is
signaled by a combination of low + high pitch.
The basic prosodic phrase comprises two words
with the first word getting low pitch and the
second word getting high pitch. These two-
word groups overlap with the syntactic group-
ing in the sense that prosodic grouping does not
pose boundaries which are syntactically impos-
sible. The prosodic phrasing pattern low + high
can be supposed to reflect the placement of the
focal accent at the right edge of a focused unit.
This is, however, a subject for future study. The
60
200
0 2.43575
F
0
(
H
z
)
Time(s)
60
200
0 1.97006
F
0
(
H
z
)
Time(s)
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
right boundary of a prosodic phrase is signaled
by a high pitch, and the right boundary of a pro-
sodic utterance gets the highest f
0
value in ex-
pressive speech. In Figure 2 f
0
courses of the
utterance , o phaan hmra Yes, Ill kill
the horse (top panel) and of the utterance o
phaan hmra too mee wt knaay Ill kill the
horse that you bought, Speaker 9 (an expres-
sive one). Both utterances are divided into three
prosodic phrases; their final high tonal gestures
are shown with arrows. The word horse is
utterance final in the top panel and gets the
highest f
0
values. In the bottom panel the same
word is in utterance medial position and does
not get the highest f
0
values; it is the utterance
final word that gets in most cases the highest f
0
value. The high pitch on horse in the first
case is the utterance-final high, while in the
second case it is a phrase-final high.
Figure 2. f
0
courses of the utterances , o
phaan hmra Yes, Ill kill the horse (top
panel) and of the utterance o phaan hmra
too mee wt knaay Ill kill the horse that you
bought, S9.
The basic phrasal pattern low + high can be
modified due to a number of factors. Thus,
when more than two words are included in a
prosodic phrase, the non-final words get low
pitch and the last one is marked with high pitch.
One-word phrases occur. They get only high
pitch, and there are no prosodic phrases with
only low pitch. One-word phrases are typically
the utterance-initial word, thus the pronominal
o I is often phrased as a one-word phrase.
Adverbials, placed utterance initially and syn-
tactically being self-sufficient units, are always
marked by a high pitch and phrased as a one-
word phrase. Figure 3 illustrates the sentence
sgii g waar Today it is sunny uttered as sgii
waar by S1. Both words get high pitch and the
utterance is by virtue of this divided into two
prosodic phrases.
Figure 3. f
0
courses of the utterance sgii waar To-
day it is sunny uttered by S1.
Clusters of two high tones within the same
prosodic phrase occur but only at utterance final
position. It happens in cases when the last two
words are pragmatically highlighted. The pat-
tern low + high is the preferred one, and no
more than two high tones are found in the same
prosodic phrase.
Final word
In elicited Kammu speech the division into
smaller and larger prosodic units, the prosodic
phrase and the prosodic utterance, is clearly sig-
naled by prosodic means. The next step is to
test this description for spontaneous speech. In
a previous study (Karlsson et al., 2007) we
could not give any prosodic cues to discrimi-
nate between the prosodic phrase and the pro-
sodic utterance and chose to operate only with
the prosodic phrase. Given that a prosodic
phrase cannot have a low tone finally, the pre-
vious material could be reanalyzed.
Acknowledgements
This work has been carried out within the re-
search project, Separating intonation from
tone (SIFT), supported by The Bank of Swe-
den Tercentenary Foundation.
References
Karlsson A.M., House D., Svantesson J-O. and
Tayanin D. (2007). Prosodic phrasing in to-
nal and non-tonal dialects of Kammu. Pro-
ceedings of the 16th International congress
of phonetic sciences, 1309-1312.
Lundstrm, Hkan & Svantesson, Jan-Olof.
(2008). Hrl singing and word-tones in
Kammu. Working Papers 53: 117131.
60
200
0 1.03088
F
0
(
H
z
)
Time(s)
60
200
0 2.22836
F
0
(
H
z
)
Time(s)
60
200
0 2.17673
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Lund University: Dept of Linguistics and
Phonetics.
Svantesson, J-O., & House, D. (2006). Tone
production, tone perception and Kammu
tonogenesis. Phonology, 23, 309-333.
Svantesson, J-O & Karlsson, A.M. (2004). Mi-
nor Syllable Tones in Kammu. Proceedings
of International Symposium on Tonal As-
pects of Languages, Beijing 28-30 March,
177-180.
.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Phonological association of tone. Phonetic implications
in West Swedish and East Norwegian
Tomas Riad
1
and My Segerup
2
1
Department of Scandinavian languages, Stockholm University
2
Centre for languages and literature, Lund University
Abstract
This paper looks for preliminary phonetic evi-
dence in support of a phonological difference
between tonal structure in West Swedish (Gte-
borg) and East Norwegian (Oslo) compounds
that occasions a distributional difference be-
tween tonal accents in the two dialects (Riad
1998, 2006). We propose that the chief differ-
ence concerns the phonological association of
the so-called prominence L tone in the HLH%
sequence. We found a tendency for the L mini-
mum to occur further to the left in East Norwe-
gian, than in West Swedish, in accordance with
the prediction.
Introduction
The intonational affinity between West Swed-
ish and East Norwegian is clear (Grding &
Stenberg 1990). The realization of tone accent
is melodically very similar. In citation form an
accent 2 word contains a HLH sequence, where
the first H (lexical or postlexical) tone occurs
on the main stressed syllable and the second H
(boundary) tone is aligned with the right edge
of the word. The second H marks the boundary
of the intonation phrase, and also carries some
of the focus function (Fretheim & Nilsen 1989).
On the face of it, one might think that the two
dialects are quite similar, and indeed they are
categorized together in Grdings Scandinavian
Accent Typology (1977). In fact, Grding and
Stenberg (1990) found that the two dialects are
quite similar, apart from the pitch height in
conjunction with the boundary H, which is
markedly higher in East Norwegian than in
West Swedish.
Nevertheless, there are a couple of facts that
signal a bigger difference than meets the ear,
and the point of this pilot study is to phoneti-
cally follow up on the proposal made for pho-
nology in Riad (1998, 2003, 2006), and to pre-
pare for a more thorough investigation.
Background
The chief reason to suspect that there is a gram-
matical difference in tonal structure between
the two dialect areas is the distribution of tone
accent in compounds and other structures con-
taining more than one phonological stress. In
Swedish dialects, except South Swedish, and in
Norwegian dialects north of Trndelag, pretty
much all regular compounds get accent 2. In
these cases, Accent 2 is assigned postlexically,
by virtue of the prosodic constellation of two
stresses within the structure. In the western and
southern Norwegian dialects and in South
Swedish, both accents occur in compounds and
compound-like structures. In these cases, the
assignment of tone accent is influenced by a
combination of lexical, morphological and pro-
sodic factors (cf. Withgott & Halvorsen 1988,
Kristoffersen 1992 for Norwegian; cf. Bruce
1974, Delsing & Holm 1988, Riad 1998, 2003,
2006 for Swedish). This difference occasions
an isogloss that cuts through the Scandinavian
peninsula dividing dialects into either type.
East Norwegian and West Swedish are on ei-
ther side of it (for maps, cf. Riad 2003:125,
2005:23, 2006:40).
1
Hypotheses from phonology
In Riad (1998) it is proposed that the isogloss
marks a difference in the tonal alignment of the
prominence tone in compounds and com-
pound-like structures. We use the term promi-
nence tone in a function-neutral way here. It
simply denotes the tone that follows the lexi-
cal/postlexical tone of accent 2. In the dialects
at hand, it is the L tone between the two H
tones. According to the original proposal (the
alignment hypothesis), the difference lies in
left-alignment of L in East Norwegian and left-
and-right-alignment of L in West Swedish.
Thus, in East Norwegian there is interpolation
from L to the final H boundary tone, whereas in
West Swedish the L tone spreads between the
H tones, occasioning a tonal floor.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
In this article, we assume an alternative hy-
pothesis (assumed in Riad 2008) where the
claim is that the difference between the types is
really one of association (the association hy-
pothesis). The phonological claim, then, is that
general accent 2 in compounds and similar
structures follows from tonal association to
both the initial primary stress and the rightmost
secondary stress. This is what we find in the
eastern-northern area of Scandinavia. Con-
versely, in dialects that allow either tonal ac-
cent in compounds, there is only one associa-
tion point, namely the initial primary stress.
This is what we find in the western-southern
area. The prediction for these latter dialects,
then, is that they instantiate tone accent in
much the same way in both simplex forms and
compounds.
2
The representational difference between the
two dialects is illustrated in Figure 1.
Gteborg
H LH%
| |
sommar-ledig-heten sommar-ledig-heten
Oslo
H L H%
|
sommar-ledig-heten sommar-ledig-heten
Figure 1. Schematic representations of the com-
pound sommarledigheten the summer vacation in
Gteborg
3
and Oslo. The contour is accent 2. The
prominence tone is underscored (L). The stylized
contours to the right give an idea of how these rep-
resentations are expected to come out melodically.
Predictions and expectations
As far as this pilot study is concerned, the chief
prediction concerns East Norwegian. We ex-
pect the lowest point between the two H tones
to be to the left rather than to the right. Just as
in simplex forms in several other dialects, the L
prominence tone should follow the initial H di-
rectly. From that point, there should be interpo-
lation to the final H%, since the L tone neither
spreads nor associates from or to any point fur-
ther to the right. Hence, we expect the L pitch
minimum (henceforth L
min
) to be leftward.
In West Swedish, the expectation under
both the alignment and association hypotheses
is that we should see an intonation floor (low
plateau) between the H tones. This makes no
particular prediction regarding the location of
the lowest point. In principle, it should quite
possibly occur to the right. One possible differ-
ence between the alignment and association
hypotheses is that L
min
should not occur further
to the right than the last stress under the asso-
ciation hypothesis.
At a general level, then, we expect the pitch
contour between the H tones to be flatter in
West Swedish than in East Norwegian.
Method
In order to find preliminary support for a dif-
ference regarding the association/alignment of
the L tone it is a good idea to look at long com-
pounds. The longer the compound, the greater
the opportunity for an unassociated contour to
rise in East Norwegian. Conversely, if the last
stress remains a relatively low point also in
long West Swedish compounds, then that is an
indication of association.
We have excerpted a number of compounds
from four speakers in local radio programs in
Gteborg (3 speakers) and Oslo (1 speaker).
Excerpted forms were all in focus position
(medial or final) such that they clearly con-
tained the HLH contour within the compound.
Our pitch analysis is carried out by means of
Praat (Boersma & Weenink 2001). We marked
the H points for each excerpted compound in
the sound object window and then identified
the lowest point between them. This can be
conveniently done by combining visual inspec-
tion with the move cursor to maxi-
mum/minimum pitch function in Praat. The
lowest point was annotated L
min
on the point
tier, and we also marked one more L point (out-
side of the syllable containing L
min
), so as to get
a reference point.
With these marks in place we get an idea of
whether the L
min
is rightward or leftward. Also,
by comparing the low points we get an idea of
the flatness of the floor or the steepness of the
rise.
Results
Pending permission, we cannot publish any of
the pitch contours from the radio material. In-
stead, we present illustrative recordings of a
Gteborg speaker in Figure 2 and of an Oslo
speaker in Figure 3.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
ek sem pel me ning ar
H Lmin L H
100
400
200
300
P
i
t
c
h
(
H
z
)
Time (s)
0 1.791
Figure 2. Gteborg compound exempelmeningar,
with L
min
in the last stressed syllable.
sam ferd sels by rd
H L Lmin H
0
300
100
200
P
i
t
c
h
(
H
z
)
Time (s)
0 1.311
Figure 3. Oslo compound samferdselsbyrd, with
L
min
in the second syllable.
In Tables 1 and 2, we give the list of the inves-
tigated compounds. In each compound, the syl-
lable containing L
min
is underscored.
Table 1. Oslo. 9 tokens, one female speaker (FAn).
word, L
min
underscored gloss
granskingsrapportne the scrutiny reports
fremskrittspartiet a the progress party
fremskrittspartiet b the progress party
ordfrersprsmlet the president question
samferdselsbyrd a communications councellor
samferdselsbyrd b communications councellor
arbeidstilsynet the work inspection
holdeplassne the stations
hestedrosjne the horse taxis
Table 2. Gteborg. 9 tokens, three male speakers
(MAg 3, MBg 2, MCg 4).
word, L
min
underscored gloss
fotbollsfamiljen the football family
seniorsidan the seniors side
fotbollslskare football lovers
dagslget day form
sidosatta set aside
exempelmeningar sentence examples
betydelseassociationen meaning associations
meningsbyggnad syntax
obeslutsamme indecisive
Discussion
As seen, there is a clear tendency for the L
min
to
occur to the left in the Oslo data. L
min
is also
quite rightward, as a tendency, in the Gteborg
data. In Oslo, the tone starts to rise immediately
after the L. In long compounds containing sev-
eral stresses, too, it keeps rising past the final
stress. If there were an association point here,
we would expect final stresses to be L. In Gte-
borg Swedish we find that the final stress is in-
variably low and whether it is the L
min
or not,
there is clearly a tonal floor between the H
points. Thus, we conclude that it is worthwhile
pursuing the association hypothesis with a
fuller investigation.
Conclusion and prospects
One of the important tasks of phonology is to
generate hypotheses for phonetics. In this pilot
study, we have begun to follow up on such hy-
potheses regarding the tonal phonology of com-
pounds in West Swedish and East Norwegian.
In the follow-up investigation we plan to make
our own recordings and see if the findings can
be further substantiated. Regarding Gteborg,
we hope to be able to clearly separate the pre-
dictions of the alignment hypothesis from those
of the association hypothesis, by studying long
compounds with the final stress at different dis-
tances from the right edge. Regarding Oslo, we
hope to show that the last (though not necessar-
ily final) stress in long compounds may be in-
tegrated into the final rise. We expect that using
both normal and loud speaking mode as used
by Segerup (2004) will be a good way of bring-
ing out the prosodic profile of the structures
investigated. As mentioned, East Norwegian
also has tone accent 1 in compounds. In those
cases the L is associated to the main stressed
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
syllable, and we expect the rise to begin even
earlier, and the following stresses to be inte-
grated into the rising slope.
Acknowledgements
berg for comments
Notes
t the isogloss as such is independent
re
ten followed
We wish to thank Sara Myr
on the text and Per Olav Heggtveit for provid-
ing us with the illustrative example of an Oslo
compound for this article.
1. Note tha
of melodic differences regarding tone value.
2. The implications of the two hypotheses a
less important for this pilot study, but they will
be heeded in the follow-up study, where con-
trolled materials will be recorded.
3. In Gteborg, the H% tone is of
by a slight fall (Grding & Stenberg 1990, Ku-
ronen 1999). Since the H% is very clearly in
the final syllable, and we focus on the L tone
before it, we will disregard the very end of the
contour.
References
David Weenink. 2001. Praat:
aat/>
Bruce, Gsta. 1974. Tonaccentregler fr sam-
Delsing, Lars-Olof & Holm, Lisa. 1988. Om
Fre .
Grding, Eva & Michal Stenberg. 1990. West
Gr avian word
Kristoffersen, Gjert. 1992. Tonelag i
Kri onology of
Boersma, Paul &
Doing phonetics by computer.
<http://www.fon.hum.uva.nl/pr
mansatta ord i ngra sydsvenska stadsml,
in: Platzack, C. (ed.): Svenskans beskriv-
ning 8, 6275.
akut accent i sknskan, in: Sagt och skrivet.
(Festskrift till David Kornhall.) Institutio-
nen fr nordiska sprk, Lunds universitet.
theim, Thorstein & Randi Alice Nilsen
1989. Terminal rise and rise-fall tunes in
East Norwegian intonation. Nordic Journal
of Linguistics 12, 155181.
Swedish and East Norwegian intonation, in:
Kalevi Wiik & Illka Raimo (eds.), Nordic
Prosody V, 111130. Turku: Dept. of Pho-
netics, University of Turku.
ding, Eva. 1977. The Scandin
accents. Lund: CWK Gleerup.
sammensatte ord i stnorsk, Norsk
lingvistisk tidsskrift 10, 3965.
stoffersen, Gjert. 2000. The Ph
Norwegian. (The phonology of the worlds
languages.) Oxford University Press, Ox-
ford.
Kuronen, Mikko. 1999. Prosodiska srdrag i
gteborgska, in: Svenskans beskrivning 23.
Lund: Lund University Press, 18896.
Riad, Tomas. 1998. Towards a Scandinavian
accent typology, in: Kehrein, W. & Wiese,
R. (eds.) Phonology and Morphology of the
Germanic Languages, 77109. (Lin-
guistische Arbeiten 386) Niemeyer, Tbin-
gen.
Riad, Tomas. 2003. Diachrony of the Scandi-
navian accent typology, in: Fikkert, P. &
Jacobs, H. (eds.) Development in Prosodic
Systems (Studies in Generative Grammar
58). Berlin/New York: Mouton de Gruyter.
91144.
Riad, Tomas. 2005. Historien om tonaccenten,
in: Falk, Cecilia & Lars-Olof Delsing (eds)
Svenska sprkets historia 8. Studentlittera-
tur, Lund. S. 127.
Riad, Tomas. 2006. Scandinavian accent typo-
logy. Sprachtypol. Univ. Forsch. (STUF),
Berlin 59 (2006) 1, 3655.
Riad, Tomas. 2008. "Brk brk brk. Ehula
hule de chokolad muus". Sprktidningen nr
2, 2008, 3439.
Segerup, My. 2004. Gothenburg Swedish word
accents a fine distinction, in: Branderud,
P. & H. Traunmller (eds). Proceedings
Fonetik 2004 (Department of Linguistics,
Stockholm University) 2831.
Withgott, Meg & Halvorsen, Per-Kristian.
1988. Phonetic and phonological considera-
tions bearing on the representation of east
Norwegian accent, in: van der Hulst, H. &
Smith, N. (eds.): Autosegmental studies on
pitch accent, 279294. Foris, Dordrecht.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Vowels in rural southwest Tyrone
Una Cunningham
Department of Arts and Languages, Hgskolan Dalarna, Falun
Abstract
This study aims to pin down some of the pho-
netic variation and oddities associated with
Northern Ireland English (NIE) in general and
the English of rural southwest Tyrone (ERST)
in particular, Vowel quality and vowel quantity
relationships are crucial here. ERST may have
short or long vowels, depending on factors that
are not phonologically interesting in other va-
rieties of English. Vowel shifts from Middle
English are only partly carried through, lead-
ing to sociophonetic variation.
Northern Ireland English
The Northern Irish English (NIE) accent is
quite distinctive in many ways. It is an accent
that is noticed outside of Northern Ireland, and
one that has often been generally stigmatised in
other parts of the UK. However there is a good
deal of variation within Northern Ireland. It is
well documented that the accents spoken in dif-
ferent parts of the province reflect different
combinations of the main accent forces that op-
erate. The peculiarities of the history of Ireland,
in particular the Plantation of Ulster in the sev-
enteenth century and the shift from Irish to Eng-
lish from the latter half of the nineteenth cen-
tury have left their mark in the way English is
spoken in different parts of Ulster to the present
day.
Southwest Tyrone
In a band stretching across Ulster from Belfast
to Donegal the dialects spoken in the Republic
of Ireland meet the Ulster Scots of the north-
ernmost counties in what is known as Mid-
Ulster English (MUE) which has been found to
share features of both dialects (Harris 1985).
Tyrone is one of the southernmost counties in
Northern Ireland. The varieties of Mid-Ulster
English found in Southwest Tyrone are particu-
larly broad, representing a variation between
older forms and newer ones. Rural speakers are
generally expected to be more conservative
than urban ones.
One of the most prominent features of NIE
is the unusual timing conditions that hold be-
tween long and short vowels. What Harris
(1985) refers to as Aitken s Law, and McCaf-
ferty (2001:133) refers to as The Scottish
Vowel Length Rule, formulated to account for
vowel length in Scots dialects is said to apply
here. This means that in certain phonetic envi-
ronments vowels that in RP would be half long,
such as [D] in bed are pronounced with a long
vowel, while vowels that would in RP be pro-
nounced with a long vowel are pronounced
with a noticeably short vowel, e.g. [f|d]. The
particular conditions of quantity in ERST will
be documented here.
The phonological system of ERST, along
with other varieties of NIE is not entirely iden-
tical to that of RP. As in Scots, the /u:/-/T/ dis-
tinction is not upheld. This is not a very linguis-
tically useful distinction, so very little commu-
nicative information is lost. Other distinctions
are made that are not made in RP, such as be-
tween horse and hoarse Wells (1982). In some
cases there is variation between two vowel
qualities, noticeably in words like pull which
are found as [pPl] (stigmatized) and [p|l]. The
first vowel of words like comfort can be either
of these in some speakers.
Material
Unlike many dialectological studies, which fo-
cus on elderly rural speakers, this study exam-
ines the vowels of young speakers. Two broth-
ers, aged 8 and 14 and their sisters aged 10 and
18 at the time of recording, were asked to read a
wordlist. The wordlist includes examples of all
the phonemes of RP and has a number of key
phonetic environments for the high front vow-
els /h9+ H/ in particular. This was part of a larger
material, including texts and spontaneous
speech. Recordings were made using a Zoom
H4 digital recorder.
Vowel quantity
McCafferty (2001) accounts for the quantity
conditions upheld in (London)Derry English,
another variety of Mid-Ulster English spoken in
Derry city, which is about 40 km from South-
west Tyrone. There are almost no phonemic
vowel length distinctions, but phonetic length-
ening is activated in certain environments (see
table one for a description of the situation in
Belfast vernacular, another variety of Mid-
Ulster English spoken about 80 km from
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Southwest Tyrone. According to Aitken s Law
(Aitken 1981), vowel length is conditioned by
the phonetic environment after the vowel. This
process may happen alongside the more general
enhanced fortis clipping that can be found in
most or all varieties of English.
Table 1. Aitken s Law as applied to Belfast ver-
nacular, another variety of Mid-Ulster English, af-
ter Harris (1985:43)
/i/ /e/ /D/
_#
_z
see
breeze
_n
_d
day
daze
rain
fade
_s
-
Des
pen
dead
mess
Long
_t
keen
seed
geese
feet
face
fate pet Short
Looking then at the results obtained by the
informants for the vowel /i:/ in Aitken s long
(see, leaves, trees, believe) and short (green,
feel, sheep) contexts, we find the following,
shown speaker by speaker, with a female Estu-
ary English (EE) speaker as a control.
Figure 1. Average vowel length for the vowel /i/ in
Aitkens long (see, leaves, trees, believe) and
short (green, feel, sheep) contexts.
So what we see here is that there is a con-
siderable difference between the vowel length
in the long condition and the short condition for
all four of the siblings in the study. This would
seem to support the assumption that Aitken s
law should apply in ERST as a variety of MUE.
However, the definition of the long and short
contexts overlaps partly with the distinction be-
tween fortis and lenis postvocalic consonants.
All of the long contexts are those in which the
enhanced length difference between vowels
preceding fortis and lenis consonants would
also lead to the vowel being long. The short
contexts, however have both fortis and lenis
post vocalic consonants. The control speaker, a
50 year-old female Estuary English speaker,
also had a considerable difference between
vowel length in long and short contexts, but the
difference is less. It appears that her long /i:/ is
about as long as the ERST /i:/, but that her short
vowels are not as short as those of the ERST
speakers.
So what happens then for vowels that are
short in RP and other accents? Consider the
case (using the denotation system widely used
in studies of varieties of English developed in
Wells 1982) of the DRESS vowel /D/. Accord-
ing to Aitken s law, this vowel will be long in
certain postvocalic contexts, such as bed, and
short in others, such as get. Unfortunately the
recorded material does not have many word list
versions of this vowel. For the FACE vowel
(Wells 1982), /e/, which is a monophthong in
ERST and other varieties of NIE there is data
however. The words day (long) and places and
great (short) can serve as examples of the way
this length condition works in ERST. Again, by
comparison, an EE speaker as control.
Figure 2. Average vowel length for the vowel /e/ in
Aitken s long (day) and short (places, great)
contexts.
Here the difference between the EE speaker
and the ERST speakers is less obvious from the
figure, but the fact that the ERST long vowel is
monophthongal makes it quite prominently
long.
Vowel quality
In NIE in general, there are a number of promi-
nent characteristics of the vowel inventory. One
is that RPs /u:/ and /T/ (Wells 1982s GOOSE
and FOOT) merge to /|/ so that boot and foot
rhyme (McCafferty 2001). Another is the varia-
tion between [U] and [T] that appears to have
sociophonetic significance. Consider the vowel
plots in Figure 3 of the F
1
vs F
2
formant fre-
quencies found in the word list elicitations of
the 14 year-old male speaker.
0 50 100 150 200 250 300 350
b8
b14
g10
g18
EE
ms
Short
Long
0 100 200 300 400 500 600
b8
b14
g10
g18
EE
ms
Short
Long
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Figure 3. Formant frequency plots for the vow-
els the word list elicited from 14 year-old boy.
Notice the quality merging of vowels in
GOOSE words (that would have /u:/ in RP) and
FOOT words (that would have /T/ in RP). No-
tice also the variation between [U]-like pronun-
ciations of FOOT words, shown on the vowel
chart as T (the well-documented case of the
word pull (e.g. McCafferty 2001)), and the [T]-
like pronunciation of STRUT words, shown on
the vowel chart as U such as comfort. So
then, as explained by McCafferty (2001:158)
words of the FOOT class are variably realized
with the GOOSE vowel [|] and the STRUT
vowel [U].
McCafferty (2001: 157-166) deals with the
variation between [U] and [T] at length. An [U]-
like pronunciation of words like pull has been
found in both rural and urban speech. This fea-
ture is very common in the vernacular, but is
also stigmatised by the upwardly aspiring. In
fact the 14 year-old informant was mocked by
his listening aunt when he read pull as [pUl].
This may be why he adjusted his pronunciation
in the next occurrence of the word to something
like [pul].
Conclusion
This study shows that the speech found in rural
southwest Tyrone demonstrates many of the
features found in previous studies. In particular,
Aitkens law appears to apply (although a fol-
low up study will hopefully fill in gaps in the
data set to further test the relationship between
Aitken s law and fortis clipping in ERST). The
GOOSE-FOOT merger and the GOOSE-
STRUT variation are features found both in
ERST and in accounts of the speech of other
communities where Mid-Ulster English is spo-
ken.
Acknowledgements
This work would have been impossible without
the cooperation and generous interest of my in-
formants in Southwest Tyrone.
References
Aitken (1981) The Scottish vowel length rule.
In So meny people longages and tonges:
Philological essays in Scots and mediaeval
English presented to Angus McIntosh, M.
Benskin and M. L. Samuels (eds), 131- 157.
Edinburgh: Middle English Dialect Project.
Harris, J. (1985). Phonological variation and
change: studies in Hiberno-English. Cam-
bridge: Cambridge Univ. Press.
McCafferty, K. (2001). Ethnicity and Language
Change. English in (London)Derry, North-
ern Ireland. Philadelphia, PA, USA: John
Benjamins Publishing Company.
Wells, J. (1982). Accents of English Vol. II
Cambridge: Cambridge Univ. Press.
F
2
F
1
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
The beginnings of a database for historical sound
change
Olle Engstrand
1
, Ptur Helgason
2
and Mikael Parkvall
1
1
Department of Linguistics, Stockholm University
2
Department of Linguistics and Philology, Uppsala University
Abstract
We report a preliminary version of a database
from which examples of historical sound
change can be retrieved and analyzed. To date,
the database contains about 1,000 examples of
regular sound changes from a variety of lan-
guage families. As exemplified in the text,
searches can be made based on IPA symbols,
articulatory features, segmental or prosodic
context, or type of change. Ultimately, the da-
tabase is meant to provide an adequately large
sample of areally and genetically balanced in-
formation on historical sound changes that
tend to take place in the worlds languages. It
is also meant as a research tool in the quest for
diachronic explanations of genetic and areal
biases in synchronic typology.
Background and purpose
From its early beginnings in the 18
th
century,
diachronic phonology has had considerable
success in documenting and reconstructing his-
torical sound changes in various languages and
language families, as well as in formulating
general theories for how and why speech
sounds change over time (see, e.g., Lehmann
1962, Anttila 1989, Lass 1997). However, the
resulting data are scattered, heterogeneous and
often hard to interpret. If the information could
be made available in a searchable and compa-
rable form, we would have access to a valuable
basis for typologically oriented studies of his-
torical sound change. With this objective in
mind, we have made a preliminary survey of
sound changes as observed in a number of lan-
guages and language families.
Method
So far, we have entered about one thousand ex-
amples of regular consonant changes into a Mi-
crosoft Excel data sheet. The rows (database
records) correspond to individual sound
changes, and the columns (fields) represent the
parameters of each change. For each change,
input sound (the changing sound) and out-
put sound (the sound resulting from the
change) are represented in terms of feature val-
ues and IPA symbols. Additional parameters
include right and left context, prosodic infor-
mation, type of change (such as elision, epen-
thesis, devoicing or assimilation) and, where
information is available, relative chronology.
Examples, questions and comments are noted
in separate fields.
An illustration is given in Table 1. The table
illustrates a small subset of database records
(rows) and their evaluations in terms of field
parameters (columns). All sound changes in the
table were sampled from the phonological de-
velopment of Vulgar Latin into Modern Italian
(Grandgent 1908, 1927, Bertinetto & Lopor-
caro 2005). It should be noted, however, that
our preliminary corpus contains more non-
European than European languages.
In the left columns, the labels C old man,
C old place, C old voice and C old sec,
refer to properties of input, i.e., old, sounds.
C stands for consonant, and man, place,
voice and sec represent manner of articula-
tion, place of articulation, voicing and secon-
dary articulation, respectively. Further to the
right, the new (output) sounds are specified
using these same dimensions. The From and
To columns show the respective input and
output sounds in IPA notation. Segmental con-
text is specified in the rightmost columns (con-
text before and context after the changing seg-
ment, respectively); V means vowel and the #
symbol represents a word boundary). For ex-
ample, the first row exemplifies a change from
[b] (a voiced bilabial stop, in this case without
any secondary articulation; thus, the latter vari-
able is evaluated as 0=zero) to a [] (a voiced
bilabial fricative, again with no secondary ar-
ticulation). The two rightmost columns indicate
that this change has occurred in intervocalic
position. The second row illustrates an elision:
a [] is dropped in intervocalic position. As this
leaves no remaining consonant, the articulatory
variables are not applicable (hence n.a.).
Rows 5 and 7 suggest that a certain sound
change may occur independently at different
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
times or in different dialect areas, rows 2 and 4
show that a given input sound may be differ-
ently affected depending on context, and rows 7
and 8 demonstrate that identical outputs can
originate from different sources. The remaining
rows can be read analogously.
Searches are performed based on transcrip-
tions or feature profiles using the programs
standard filter functions. For example, search-
ing /r/ as the input sound will return all histori-
cal developments from /r/ that are represented
in the database. For another example, a search
for [nasal] & [consonant] as the right-hand con-
text will yield all changes that have taken place
before nasal consonants. Searches may be con-
strained by combining several criteria; assume,
for example, that the input sound has the fea-
tures [velar] & [stop], that its right-hand con-
text is specified in terms of [vowel] & [front],
and that the output sound is [palatal] or [cor-
onal] & [obstruent]. These conditions will be
met by various types of an intuitively ubiqui-
tous historical sound change, that is velar pala-
talization in front vowel context (Guion 1998).
Table 1. A subset of database records and some field values for consonants.
En-
try
C old
man
C old
place
C old
voice
C old
sec
From To C new
man
C new
place
C new
voice
C new
sec
Ctx
be-
fore
Ctx
after
1 stop bilab vd 0 b fric bilab vd 0 V V
2 fric bilab vd 0 0 n.a. n.a. n.a. n.a. V V
3 glide labvel vd 0 w g stop vel vd labzd # V
4 fric bilab vd 0 v fric labdent vd 0 # V
5 fric cor vl 0 s fric postalv vl 0 # V
6 fric cor vl 0 s ts affric cor vl 0 # V
7 fric cor vl 0 s fric postalv vl 0 # V
8 affric postalv vl 0 fric postalv vl 0 # V
9 glide pal vd 0 j affric postalv vd 0 # V
10 affric postalv vd 0 fric postalv vd 0 # V
Looking for patterns: an example
The results of an exemplifying test run are
given in Table 2. The table, which is based on
993 recorded cases, illustrates the relative inci-
dence of the indicated kinds of historical sound
change. Types of change are shown in the left
column, and frequencies of occurrence pertain-
ing to the respective types are shown as per-
centages in the right column. It can be seen, for
example, that 18 percent of all cases are eli-
sions, that 10 percent represent a development
from stop to fricative (i.e., fricativization), and
that debuccalization, delabialization and de-
palatalization stand for 4, 6 and 3 of the
changes, respectively. (The term debuccaliza-
tion is used to refer to a process whereby a
consonants supraglottal place of articulation is
substituted by [h] or [].) If all these types are
regarded as lenitions, 39 percent of all sound
changes in the database are lenitions (this fig-
ure is slightly lower than the sum of the per-
centages because a change may occasionally
comprise more than one of these components).
It should be pointed out, however, that the true
proportion of lenitions is probably much
higher, because assimilations were not counted
in this test run (the reason being that all neces-
sary contextual information is not yet in place).
Even though these figures are not necessarily
representative of the languages of the world,
observations of this kind may enhance our un-
derstanding of the driving forces behind proc-
esses of historical sound change.
Table 2. Types of lenitions given as percentages of
993 recorded changes. The bottom row summarizes
the total percentage of lenitions in terms of these
types.
Type of change Percent
(N=993)
Elision 18
Fricativization 10
Debuccalization 4
Delabialization 6
Depalatalization 3
Total lenitions 39
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Future developments and applica-
tions
One long-term goal of the present project is to
create a database that will be fairly representa-
tive of the historical sound changes that tend to
take place in the languages of the world. This
will require a large number of records based on
a genetically and areally balanced selection of
languages and language families. Clearly, the
interpretation of many of these sources will re-
quire expert advice and assistance.
As a phonetic contribution, a database for
diachronic phonology will help to identify the
preconditions of historical sound change that
are hidden in universal constraints on speech
production, perception and development (e.g.,
Hombert et al. 1979, Greenlee & Ohala 1980,
Locke 1983, Svantesson 1983, Lindblom 1990,
Ohala 1993, Willerman 1994, Lindblom et al.
1995, Engstrand & Williams 1996, Helgason
2002). This perspective also has a potential for
guiding experimental quests for parallels be-
tween historical sound changes and on-line
speech communication phenomena such as
coarticulation, reduction and restoration (Ohala
1989, Engstrand & Lacerda 1996, Engstrand et
al. 2007). In addition, given a full-size, repre-
sentative database, it will be possible to track
down developmental tendencies in genetically
or areally defined language groups and to com-
pare the diachronic data with the corresponding
typological patterns as observed in synchronic
databases such as UPSID (Maddieson 1984).
Thus, diachronic data may help to account for
patterns of areal typology that are not readily
accessible to synchronic explanation, e.g.,
crazy rules or telescope rules such as the
alternation p>s/_i in Bantu (Bach & Harms
1972, Hyman 1975, Ohala 1983, Blevins 2004)
as well as areal correlations among complex
segments such as prenasalized, implosive and
doubly articulated stops (Ladefoged 1964,
Sherman 1975, Maddieson 1984, Lindblom &
Maddieson 1988, Ladefoged & Maddieson
1996, Engstrand 1997, J anson & Engstrand
2001). Many such marked phonologies may
become quite transparent in a historical per-
spective.
Acknowledgements
This project is being carried out in collabora-
tion with J uliette Blevins, Max Planck Institute
for Evolutionary Anthropology, Leipzig. Many
thanks to Pier Marco Bertinetto, Scuola Nor-
male Superiore, Pisa, for valuable help and ad-
vice. This work has been supported in part by
Fondazione Famiglia Rausing through a grant
to the first author.
References
Anttila R. (1989) Historical and comparative
linguistics. Philadelphia: J ohn Benjamins.
Bach E. & Harms R.T. (1972) How do lan-
guages get crazy rules? In R P Stockwell &
R K S Macauly (eds.), Linguistic change
and generative theory, 1-21. Bloomington:
Indiana University Press.
Bertinetto P.M. & Loporcaro M. (2005) The
sound pattern of Standard Italian, as com-
pared with the varieties spoken in Florence,
Milan and Rome. J ournal of the Interna-
tional Phonetics Association 35, 131-151.
Blevins J . (2004) Evolutionary phonology: The
emergence of sound patterns. Cambridge:
Cambridge University Press.
Engstrand O. (1997) Areal biases in stop para-
digms. PHONUM 4, 187-190.
Engstrand O. & Lacerda F. (1996) Lenition of
stop consonants in conversational speech:
evidence from Swedish (with a sideview on
stops in the worlds languages). Arbeits-
berichte, Institut fr Phonetik und digitale
Sprachverarbeitung, Universitt Kiel
(AIPUK), 31, 31-41.
Engstrand O. & Williams K. (1996) VOT in
stop inventories and in young childrens vo-
calizations: preliminary analyses. Proceed-
ings of FONETIK 1996, Nsslingen, May
1996. Speech, Music and Hearing, Royal
Institute of Technology, Quarterly Progress
and Status Report 2/1996, 97-99.
Engstrand, O., Frid J . and Lindblom B. (2007)
A perceptual bridge between coronal and
dorsal /r/. In P. Beddor, M. Ohala and M.-J .
Sol (eds.), Experimental Approaches to
Phonology. Oxford University Press, 175-
191.
Grandgent C.H. (1908) An introduction to Vul-
gar Latin. Boston: Heath & Co.
Grandgent C.H. (1927) From Latin to Italian.
An historical outline of the phonology and
morphology of the Italian language. Cam-
bridge: Harvard University Press.
Greenlee, M. & Ohala, J .J . (1980) Phonetically
motivated parallels between child phonol-
ogy and historical sound change, Language
Sciences 2, 283-308.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Guion S. (1998) The role of perception in the
sound change of velar palatalization. Pho-
netica 55, 18-52.
Helgason P. (2002) Preaspiration in the Nordic
languages: Synchronic and diachronic as-
pects. PhD diss., Stockholm University.
Hombert J .-M., Ohala J .J . & Ewan W.G. (1979)
Phonetic explanations for the development
of tones. Language 55, 37-58.
Hyman L.M. (1975) Phonology: theory and
analysis. New York: Holt, Rinehart and
Winston.
J anson T. & Engstrand O. (2001) Some unusual
sounds in Changana. Proceedings of
FONETIK 2001, rens, May 30-J une 1,
2001. Working Papers, Department of Lin-
guistics, Lund University 49, 74-77.
Ladefoged P. (1964) A phonetic study of West
African languages. Cambridge: Cambridge
University Press.
Ladefoged P. & Maddieson I. (1996) The
sounds of the worlds languages. Oxford:
Blackwell.
Lass R. (1997) Historical linguistics and lan-
guage change. Cambridge: Cambridge Uni-
versity Press.
Lehmann W.P. (1962) Historical linguistics:
An introduction. New York: Holt, Rinehart
and Winston.
Lindblom B. (1990) Explaining phonetic varia-
tion: A sketch of the H&H theory. In W.J .
Hardcastle & A. Marchal (eds.), Speech
production and speech modeling, Kluwer:
Dordrecht, 403-439.
Lindblom B. & Maddieson I. (1988) Phonetic
universals in consonant systems. In L.M.
Hyman & C.N. Li (eds.), Language, speech,
and mind. New York: Routledge, 62-78.
Lindblom B., Guion S., Hura S., Moon S.-J . &
Willerman R. (1995) Is sound change adap-
tive? Rivista di Linguistica 7, 5-37.
Locke J .L. (1983) Phonological acquisition and
change. New York: Academic Press.
Maddieson I. (1984) Patterns of sounds. Cam-
bridge: Cambridge University Press.
Ohala J .J . (1983) The origin of sound patterns
in vocal tract constraints. In P.F.
MacNeilage (ed.), The Production of
Speech, 189-216. New York: Springer-
Verlag.
Ohala, J .J . (1989) Sound change is drawn from
a pool of synchronic variation. In L.E.
Breivik & E.H. J ahr (eds.), Language
Change: Contributions to the study of its
causes. [Series: Trends in Linguistics, Stud-
ies and Monographs No. 43]. Berlin: Mou-
ton de Gruyter, 173-198.
Ohala J .J . (1993) The phonetics of sound
change. In C. J ones (ed.), Historical Lin-
guistics: Problems and Perspectives. Lon-
don: Longman, 237-278.
Sherman D. (1975) Stop and fricative systems:
a discussion of paradigmatic gaps and the
question of language sampling. Stanford
University Phonology Archiving Project,
Working Papers on Language Universals
17, 1-31.
Svantesson J .-O. (1983) Kammu phonology
and morphology. Lund: Gleerup.
Willerman R. (1994) The phonetics of pro-
nouns: articulatory bases of markedness.
PhD diss., University of Texas, Austin.
Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg
Author index
Ambrazaitis, Gilbert 81
Ananthakrishnan, G 9
Barry, William J . 25
Beskow, J onas 33, 61, 85
Blomberg, Mats 37
Bruce, Gsta 61, 85
Cunningham, Una 97
Diehl, Randy 5
Edlund, J ens 17, 29, 33
Eir Cortes, Elisabet 1
Engstrand, Olle 101
Engwall, Olov 57
Enflo, Laura 61, 85
Elenius, Daniel 37
Frid, J ohan 41
Granstrm, Bjrn 33, 61, 85
Gustafson, J oakim 17, 33, 69
Heldner, Mattias 29
Helgason, Ptur 65
Hincks, Rebecca 21
House, David 89
J onsson, Oskar 33
Karlsson, Anastasia 89
Koreman, J acques 25
Laskowski, Kornel 29
Lindblom, Bjrn 1, 5
Lindseth, Marte Kristine 25
McAllister, Anita 77
Neiberg, Daniel 9
Park, Sang-Hoon 5
Parkvall, Mikael 65
Pind, J rgen L. 49
Riad, Tomas 93
Salvi, Giampiero 5
Schtz, Susanne 61, 85
Segerup, My 93
Seppnen, Tapio 53
Skantze, Gabriel 33
Stenberg, Michal 13
Strangert, Eva 69
Tayanin, Damrong 89
Toivanen, J uhani 53, 73
Tronnier, Mechtild 77
van Dommelen, Wim A. 45
Wik, Preben 57
Vyrynen, Eero 53
Ylitalo, Riika 65
Proceedings, IAFPA 2006, Department of Linguistics, University of Gothenburg