Beruflich Dokumente
Kultur Dokumente
Andrew Butcher
Abstract
The field of forensic phonetics has developed over the last 20 years or so and embraces a number
of areas involving analysis of the recorded human voice. The area in which expert opinion is
most frequently sought is that of speaker identification – the question of whether two or more
recordings of speech (from suspect and perpetrator) are from the same speaker. Automated
analysis (in which Australia is a world leader) is only possible where recording conditions are
identical. In the most frequently encountered real-world forensic situation, comparison is
required between a police interview recording and recordings made via telephone intercepts or
listening devices. This necessitates a complex procedure, involving auditory and acoustic
comparison of both linguistic and non-linguistic features of the speech samples in order to build
up a profile of the speaker. The most commonly used measures are average fundamental
frequency and the first and second formant frequencies of vowels. Much work is still needed to
develop appropriate statistical procedures for the evaluation of phonetic evidence. This means
estimating the probability of finding the observed differences between samples from the same
speaker and the probability of finding those same differences between samples from two
different speakers. Thus there needs to be an acceptance that the outcome will not be an absolute
identification or exclusion of the suspect. By itself, your voice is not a complete giveaway.
challenge the prosecution’s version of what was actually said in the course of a recorded
conversation. Forensic phoneticians may be asked to prepare a report on the quality of the
recording and the intelligibility of the speech. They may also be asked to prepare an ‘objective’
transcript of the recording.
1.3 Tape authentication. Occasionally a defendant (or a civil litigant) may have cause to question
whether an audio recording has been tampered with in some way. Usually the claim is that
certain sections have been excised or perhaps transposed. It is not generally within the
competence of a phonetician to give an opinion as to the physical condition of a tape, but there
may be evidence within the acoustic signal (‘pops’ or abrupt changes in either the signal itself or
the background noise) which would be indicative of electronic editing. However, currently
available software makes ‘seamless’ editing comparatively easy, and a phonetician may be
needed to give an opinion on the only remaining evidence of any tampering – linguistic evidence
in the form of unnatural changes in rhythm, tempo or intonation.
1.4 Voice line-ups. The practice of confronting witnesses of a crime with a tape recorded ‘voice
line-up’, where the voice of a suspect is included amongst a series of ‘foils’, may be used to
obtain evidence of identification in cases where, in the course of committing a crime, an unseen
or masked perpetrator has spoken in the presence of the witnesses. This recording is played to
the witness(es) and they are asked to state whether they can identify any of the voices as that of
the perpetrator. In order to be entirely fair to the suspect, there are a number of criteria which
need to be observed (Broeders & Rietveld 1995; Hollien, Huntley, Künzel & Hollien P 1995). As
with visual identification parades, it is a general principle of fairness in the conducting of voice
line-ups is that there should be no feature of any of the voices or the recordings which would
cause non-witnesses to pick out a particular speaker (whether suspect or foil) as being different
from the rest. A phonetician may be consulted on aspects of the construction of the tape and the
administration of the confrontation.
2
Forensic Phonetics Butcher
notably in the USA. This methodology, which involved the visual inspection and impressionistic
comparison of sound spectrograms, was regarded sceptically by the scientific community at the
time, and has since been entirely discredited (Hollien 1990, 2002; Gruba & Poza 1995). The
term “Voiceprint” suggests that the technique is analogous to forensic techniques such as
fingerprinting or DNA analysis. There are a number of reasons why this is an inappropriate
analogy. Firstly, there is no single feature of the voice which is unique to every speaker. Unlike
the vanishingly small possibility in the case of fingerprints or DNA molecules, it is quite
possible for two speakers to be, for all practical purposes, identical in some respect. Secondly,
most (if not all) of the features of the voice which are measurable in recordings of the quality
typically encountered in the forensic context are capable of being consciously changed by the
speaker. These include, voice pitch, aspects of voice quality, consonantal articulation, and vowel
quality. At present it is not impossible for a skilled mimic to defeat the forensic voice
identification procedure. Thirdly, for most of the voice features, we do not have sufficient data
on the normal population to know what the chances are of two speakers being similar or identical
with respect to that feature. Finally, acoustic parameters vary as a consequence of differences in
recording conditions as well as of differences in the voice itself. Australia leads the world in the
technology of automatic speaker recognition (in 2001 a team from the RCSAVT Speech
Research Lab at Queensland University of Technology won two of the categories for single
speaker detection tasks in the National Institute of Standards & Technology’s benchmark tests
on speaker recognition), but automatic speaker recognition is not yet able to separate out
variation due to speaker differences from variation due to recording conditions (and it is doubtful
whether it will ever be able to). Thus automatic speaker recognition techniques are of limited use
in the typical forensic situation, where a voice recorded over the telephone or via a listening
device is to be compared with a voice recorded in a police interview room. The intervention of a
phonetically and linguistically qualified human operator is required. The main components of the
procedure are an auditory analysis and an acoustic analysis, each of which in turn has a number
of component parts. Voice ID is therefore more appropriately compared with a technique such as
a ‘photo-fit’ type of procedure, where a number of features are considered as part of an overall
profile.
3
Forensic Phonetics Butcher
This part of the analysis involves careful and repeated listening by the expert, noting features of
the voices in question under four basic headings. Firstly, voice quality features are ascertained.
This means describing ‘voice’ in the technical sense – i.e. the sound made by the vibration of the
vocal folds – and ignoring for the moment any variations contributed by the resonances of the
throat, mouth and nasal passages above. It can be done using one of a number of descriptive
frameworks (e.g. Isshiki & Takeuchi 1970; Laver 1980; Wendler, Rauhut & Krüger 1986; Oates
& Russell 1998), whereby aspects of the voice can be quantified according to parameters such as
‘roughness’, ‘strain’, ‘creakiness’, ‘breathiness’ and so on – terms which are meaningful to other
phoneticians and speech scientists and which describe in as accurate and objective way as
possible the auditory impressions of the listener. Secondly, the investigator attends to the non-
linguistic characteristics of the speech which are not produced by the larynx. This means
listening to the effects of the long-term setting of the throat, the tongue and lips and the
resonances of the nasal passages and sinuses. This is known as the articulatory setting, and here
too, established descriptive frameworks are available (Laver 1980; Esling 1994) which rate the
voice according to such parameters as ‘hypernasality’, ‘pharyngealisation’, ‘labialisation’, as
well as vertical position of the larynx. The third set of parameters relate to aspects of (mainly
vowel) articulation which provide clues to the speaker’s geographical and social background. In
long-established linguistic communities such as in the United Kingdom and Europe, this part of
the analysis can provide very useful information. In a recently established community such as
(non-Aboriginal) Australia, the information which can be gleaned is usually quite scanty.
Australian English accents are traditionally classified on a three-point scale as being ‘Broad’,
‘General’ or ‘Cultivated’ (Mitchell & Delbridge 1965), but there are very few features which
enable us to pinpoint the speaker’s geographical origins with any accuracy. One or two
pronunciations are peculiar to Queensland and another one or two distinguish speakers with a
South Australian background. A more recent phenomenon is the “pan-ethnic” accent (sometimes
known as “wogspeak”) which has developed among second- and subsequent-generation
Australians of non-English-speaking background (Warren 1999). The final component of the
auditory analysis is the identification of any idiosyncratic pronunciation features which may be
present. The more commonly occurring idiosyncrasies involve the articulation of consonants,
and include various types of ‘lisp’, the labialising of ‘r’ (‘rabbit’ becomes something likes
4
Forensic Phonetics Butcher
‘wabbit’) and the pronunciation of ‘th’ as ‘v’. Apart from this, speakers may exhibit various
kinds of dysfluency, including stuttering, ‘cluttering’ and slurring of words.
5
Forensic Phonetics Butcher
Figure 1: Waveform (above) and pitch contour (below) of the utterance “We went to Woolloomooloo”
Figure 1 shows a waveform and pitch contour for an Australian English sentence. The waveform
at the top represents the tiny variations in air pressure caused by the transmission of the sound
waves. The bottom trace shows the variation in frequency of those vibrations over time: the
fundamental frequency. Each speaker has a particular range of fundamental frequency which
s/he habitually uses and within which s/he feels most comfortable and this is an important
measure for forensic purposes, because it is one of the few measures for which we know the
distribution amongst the adult population at large. The average speaking fundamental frequency
for an adult caucasian male is 113 Hz, and 50% of the male population lie somewhere between
100 to 130 Hz in spontaneous speech (Kuenzel 1989). The corresponding average for females is
225 Hz. Figure 2 shows how this measure may be used in building up a voice profile. In this case
the voice of a person issuing a ransom demand over the telephone is compared with the voices of
two suspects (Butcher & Moody 1999). Clearly the fundamental frequency of suspect 1 is much
closer to that of the perpetrator than is the fundamental frequency of suspect 2. Furthermore both
the perpetrator and suspect 1 differ markedly from the population mean and in the same
direction.
6
Forensic Phonetics Butcher
220
200
mean fundamental frequency (Hz)
180
160
mean fundamental
140
120
100
80
60
m
'
'I'
'
'I'
'
'I'
'D
'D
co
co
1'
nt
1
rp
rp
nt
on
on
Sa
pe
1
nt
Sa
pe
on
R
Sa
R
Figure 2:Graph of mean fundamental frequency of three speakers in a number of recordings. The vertical lines
represent one standard deviation either side of the mean. The dashed line represents the mean for the
adult male population.
7
Forensic Phonetics Butcher
which represent the two recording conditions most commonly offered for comparison in the
forensic situation. Clearly this measure can only be used in the limited number of situations
where the conditions under which recordings have been made can be assumed to be similar.
energy (dB) →
8
Forensic Phonetics Butcher
F3 F3 F3
F2
F2
F2
F1
F1 F1
0 time → 1.775 s
Figure 4: A sound spectrogram of the words ‘head, had, hard’, spoken by an adult male. The dark horizontal bands
(F1, F2, F3) in the vowels represent areas of higher energy known as FORMANTS.
A useful way of summarising vowel formant frequency data from a given speaker is to plot the
mean values of the first formant against the mean values of the second formant for all the
vowels. This provides a characteristic pattern or ‘vowel space’ for the speaker, as shown in
Figure 5, which is based on data measured from the voice of a murder suspect during interview.
In this figure the first formant frequency is shown on the vertical axis and the second formant
frequency on the horizontal axis. The origins of the axes are placed in the top right hand corner,
so that the positions of the points on the chart relate approximately to the position of the tongue
and jaw: vowels pronounced with a forward position of the tongue and spread lips appear on the
left and those with a retracted tongue and rounded lips appear on the right. Vowels with a raised
tongue and closed jaw are at the top and vowels with a lower tongue and open jaw at the bottom.
The individual letters represent a point positioned at the intersection of the means of the first and
second formant frequencies of the vowel in question. The ellipses represent a distance of two
standard deviations around the mean for that vowel, i.e the area which would include 95% of the
speaker’s vowels of that type.
9
Forensic Phonetics Butcher
Figure 6: Comparison of short vowels from a suspect in a police interview recording with corresponding vowels of
a speaker recorded via a listening device. The ellipses are the same as in Figure 5 – i.e. they represent
two standard deviations around the means for the suspect’s voice. The phonetic symbols represent
individual short vowels from the unknown speaker.
10
Forensic Phonetics Butcher
In Figure 6 the same ellipses are superimposed on a set of data points representing the formant
frequencies of vowels from an unknown speaker recorded via a listening device. The degree of
overlap between the two speakers can be roughly quantified by calculating the proportion of
vowel points from the unknown speaker which fall within the appropriate ellipse of the suspect
speaker. In this particular diagram, only 50% of the unknown speaker’s vowels fall within the
corresponding ellipse of the suspect speaker. Based on this data alone, there would have to be
considerable doubt that the speakers are the same.
Data from a different case are shown in Figures 7 and 8. In these plots the mean frequencies of
the vowel sets are compared. In Figure 7 the combined mean values from a perpetrator’s vowels
in a number of phone calls are compared with the values for the equivalent vowels spoken by 20
adult male speakers of General Australian English from the Australian National Database of
Spoken Language (Millar, Vonwiller, Harrington & Dermody 1994). The two patterns look quite
different, and in the overall mean difference between the values of the perpetrator and those of
this sample of the general population is 12.2%. Figure 8 shows the same set of perpetrator
vowels compared with those of a suspect. The degree of similarity between the two patterns
appears much greater, and indeed the mean difference between the values for the perpetrator and
those for the suspect is 3.3%. Thus the formant frequencies of the perpetrator are considerably
closer to those of the suspect than they are to those of the general population. Experience
suggests that a variation of 5% or less is of the order expected within a single speaker.
These, then are the major parameters that may be used to build up a profile of a two or more
voices for the purposes of forming an opinion as to their overall similarity.
11
Forensic Phonetics Butcher
firstfirst
formant
formant
frequency
frequency
(Hz)
(Hz)
→→
Figure 7: Comparison of vowels from a perpetrator with vowels from the Australian National Database of Spoken
Language. Each phonetic symbol represents the mean for that vowel in one of the two sets of data. All
means for a given data set are connected by a line:
= perpetrator = ANDOSL data
Figure 8: Comparison of vowels from a perpetrator with vowels from a suspect. Each phonetic symbol represents
the mean for that vowel in one of the two sets of data. All means for a given data set are connected by a
line: = perpetrator = suspect
12
Forensic Phonetics Butcher
13
Forensic Phonetics Butcher
samples of X degree of similarity, 85% are from the same speaker. This means that the
probability of observing X degree of similarity between samples from the same speaker would
be 85% and the probability of finding X degree of similarity between different speakers would
be 15%. The likelihood ratio is then 85 divided by 15 or 5.67.
A likelihood ratio greater than 1.0 supports the prosecution hypothesis – i.e shows that the
degree of similarity found between the speech samples is more likely if they were by the same
speaker than if they were by different speakers. A likelihood ratio less than 1.0 supports the
defence hypothesis – i.e shows that the degree of similarity found between the speech samples is
more likely if they were by different speakers than if they were by the same speaker. The value
of the likelihood ratio thus quantifies the strength of the evidence, and likelihood ratios from
different areas of forensic evidence can be combined. Each successive likelihood ratio should be
evaluated in terms of the degree of confidence in the assertion of guilt before consideration of
the evidence in question (the so-called ‘prior odds’) (Robertson & Vignaux 1995).
14
Forensic Phonetics Butcher
References
BALDWIN J & FRENCH P (1991) Forensic Phonetics. London & New York: Pinter.
BRAUN A (1995) Fundamental frequency – how speaker-specific is it? In: A BRAUN & J-P
KÖSTER (eds) Studies in Forensic Phonetics. Trier: Wissenschaftlicher Verlag Trier, 9-23.
BROEDERS APA & RIETVELD ACM (1995) Speaker identification by earwitness. In A Braun and
J-P Köster (eds), Studies in Forensic Phonetics. Trier: Wissenschaftlicher Verlag.
BUTCHER AR & MOODY MP (1999) The case of the ‘third voice’: a rare opportunity for closed
set comparison in the forensic context. Paper presented at the Annual Conference of the
International Association for Forensic Phonetics, York, England.
ESLING JH (1994) Voice quality. In R.E. Asher & J.M.Y. Simpson (eds) The Encyclopedia of
Language and Linguistics. Oxford: Pergamon Press, 4950-4953.
HOLLIEN H (1990) The Acoustics of Crime. New York & London: Plenum.
HOLLIEN H, HUNTLEY RA, KÜNZEL HJ & HOLLIEN PA (1995) Criteria for earwitness lineups.
Forensic Linguistics 2, 143-153.
ISSHIKI N & TAKEUCHI Y (1970) Factor analysis of hoarseness. Studia Phonologica 5, 37-44.
KÜNZEL HJ (1989) How well does average fundamental frequency correlate with speaker height
and weight? Phonetica 46, 117-125.
LAVER J (1980) The Phonetic Description of Voice Quality. Cambridge: Cambridge University
Press.
OATES JM & RUSSELL A (1998) Learning voice analysis using an interactive multi-media
package: Development and preliminary evaluation. Journal of Voice 12, 500-512.
MILLAR JB, VONWILLER J, HARRINGTON JM & DERMODY P (1994). The Australian National
Database of Spoken Language. Proceedings of the International Conference on Acoustics,
Speech and Signal Processing, Adelaide, 67-100.
15
Forensic Phonetics Butcher
16