Beruflich Dokumente
Kultur Dokumente
The sound sources are idealized as periodic, impulsive, or white noise and
can occur in the larynx or vocal tract.
Speech Organs
The lung acts as a power supply and provide airflow to the larynx.
The larynx modulates the airflow and provides either a periodic puff-like or
a noisy airflow source to the vocal tract.
The vocal tract consists of oral, nasal, and pharynx cavities, giving the
modulated airflow its color by spectrally shaping the source.
The variation of air pressure at the lips results in traveling sound wave
that the listener perceives as speech.
The study of sound variations of phonemes that lead to the same meaning
is called phonetics.
During speaking, we take in short spurts of air and release them steadily.
We override our rhythmic breathing by making the duration of exhaling
roughly equal to the length of a sentence or phase, where the lung air
pressure is maintained at approximately a constant level.
5
The vocal folds are two masses of flesh, ligament, and muscle, which
stretch between the front and back of the larynx.
Fig. 3.3 Sketches of downward-looking view of human larynx (a) voicing; (b)
breathing.
The Glottis
The size of the glottis controlled in part by the arytenoids cartilages and in
part by muscles within the folds.
The tension of the folds is controlled by muscle within the folds, as well as
the cartilage around the folds.
The vocal folds, as well as the epiglottis, close during eating, and open
during breathing.
The time interval during which the vocal folds are closed, and no flow
occurs, is referred to as the glottal closed phase.
10
The time interval over which there is nonzero flow and up to the maximum
of the airflow velocity is referred to as the glottal open phase, and the time
interval from the airflow maximum to the time of glottal closure is referred
to as the return phase.
The time duration of one glottal cycle is referred to as the pitch period and
the reciprocal of the pitch period is the corresponding pitch, also referred
to as the fundamental frequency.
The pitch range is about 60 Hz to 400 Hz. Typically, males have lower pitch
than females because their vocal folds are longer and more massive.
11
Fig. 3.7 Illustration of periodic glottal flow: (a) typical glottal flow; (b) same a
(a) with lower pitch; (c) same as (a) with softer glottal flow.
12
(2-2)
1
U [ , ] = W ( , ) [ G ( ) ( k )].
P
k =
(2-3)
1
= [ G ( k )W ( k , )] ,
P k =
2
P
k,
2
P
In more relaxed voicing (Fig. 3.7), the vocal folds do not close as abruptly,
and the glottal waveform has more rounded corners with an average 15dB/octave rolloff.
The extend and form of jitter and shimmer can contribute to voice
character.
A high degree of jitter results in a voice with hoarse quality, which can be
characteristics of a particular speaker or can be created under specific
speaking conditions such as with stress and fear.
14
2.1.2 Unvoicing?
In the unvoiced state, the folds are closer together and more tense than in
the breathing state, thus allowing for turbulence to be generated at the
folds.
There are other forms of vocal fold movement that do not fall clearly into
any of the three states of breathing, voicing and unvoicing.
In vocal fry (Fig. 3.9.a), the folds are massy and relaxed with an abnormally
low and irregular pitch, which is characterized by secondary glottal pulses.
In diplophonia (Fig. 3.9.b), secondary glottal pulses occur between primary
pulses but within the closed phase.
The vocal tract is comprised of the oral cavity from the larynx to the lips
and the nasal passage that is coupled to the oral tract by way of the
velum.
The vocal tract spectrally colors the source, which is important for
making perceptually distinct speech sounds.
j 0
H ( z) =
A
kN=i 1 (1 c k z 1 )(1 c k* z 1 )
Ni
~
A
k =1
(1 c k z 1 )(1 c k* z 1 )
18
(2-4)
2.2.2 Fourier Transform of speech signals after going through the vocal tract.
Assuming a periodic glottal flow source of the form:
(2-2)
The vocal tract output after passing u[n] through a LTI vocal tract with
impulse response h[n] is
(2-3)
(2-4)
x[n, ] X ( , ) = W ( , ) H ( )G ( ) ( k ) .
P
k =
DTFT
19
(2-5)
20
A formant corresponds to the vocal tract poles, while the harmonics arise
from the periodicity of the glottal source.
22
Fig. 3.13 Examples of voiced, fricative, and plosive sounds in the sentence,
Which tea party did Baker go to?: (a) speech waveform; (b)-(d) magnified
voiced, fricative, and plosive sounds from (a).
23
The Fourier transform of the windowed speech waveform, i.e. the shorttime Fourier transform (STFT), is given by
X ( , ) =
x[n, ] exp( jn ) .
(3-1)
n =
The spectrogram is a graphical display of the magnitude of the timevarying spectral characteristis and is given by
S ( , ) =| X ( , ) | 2 ,
(3-2)
tapper
variations
Fig. 3.14 Formation of (a) the narrowband and (b) the wideband spectrograms.
26
Fig. 3.15 Comparison of measured spectrograms for the utterance, which tea
party did Baker go to?: (a) speech waveform; (b) wideband spectrogram; (c)
narrowband spectrogram.
27
E.g. the
29
30
Vowels
Nasals
Broadband width
Low frequency
The source is quasi-periodic airflow puffs from the vibrating vocal folds.
The velum is lowered and the air flows mainly through the nasal cavity, the
oral tract being constricted; thus sound is radiated at the nostrils. E.g. /m/
in mo (oral tract constriction) and /n/ in no (constriction is with the
tongue to the gum ridge).
The spectrum of a nasal is dominated by the low resonance of the large
volume of the nasal cavity, which also have a large bandwidth because of
the viscous losses of airflow over the complexly configured surface.
The closed oral cavity has its own resonances and absorbs acoustic
energy. These anti-resonances can be modeled as zero of the vocal tract
transfer function.
In nasalization of vowels, the velum is partially open. The speech sound is
primarily due to the sound at the lips and not the sound at the nose output.
Vowels adjacent to nasal consonants tend to be nasalized.
35
Fricatives
37
38
(4-1)
where
39
Plosives
Fig. 3. 24 Vocal tract configurations for unvoiced and voiced plosive pairs.
Plosives can be both voiced and unvoiced.
The voiced onset time is the difference between the time of the burst and
the onset of voicing in the following vowel. The length of the voice onset
time and the place of constriction vary with the plosive consonant.
40
41
42
In voiced plosives, although the oral tract is closed, we hear a lowfrequency vibration, called the voice bar, due to the propagation of the
vibration at the vocal folds through the walls of the throat. Unlike unvoiced
plosives, there is little aspiration.
A simple for voiced plosive is
(4-1)
m =
m =
Due to the changing vocal tract shape during the transition from the burst
to a following steady vowel, h and hf are assumed to be linear, but timevarying.
The burst is modeled as an impulse which is assumed to occur at time n=0.
43
Fig. 4.14 Concatenated tube model. The k-th tube has cross-sectional area Ak
and length lk.
Due to time limitation, we shall not go into these acoustic models.
44
If the airflows are models as signal flows, then the above tube model can
be approximated by the following signal flow graph:
Fig. 4.16 Signal flow graphs of (a) two concatenated tubes; (b) lip boundary
condition; (c) glottal boundary condition.
45
Fig. 4.18 Signal flow graph conversion to discrete time of (a) lossless two
tube model; (b) discrete-time version of (a); (c) conversion of (b) with singlesample delays.
46
47
G(z) is the z-transform of the glottal pulse. V(z) is the z-transform of the
vocal tract transfer function, and R(z) is the radiation loss at the lips (R(z) in
dotted line models the radiation loss at the glottis). An approximation of a
typical glottal flow waveform over one cycle is of the form
(4-1)
The z-transform is
G( z) =
1
(1 z ) 2
(4-1)
, < 1.
48
49