Sie sind auf Seite 1von 34

Lecture # 2

Session 2003

Acoustic Theory of Speech Production


Overview
Sound sources
Vocal tract transfer function
Wave equations
Sound propagation in a uniform acoustic tube
Representing the vocal tract with simple acoustic tubes
Estimating natural frequencies from area functions
Representing the vocal tract with multiple uniform tubes

6.345 Automatic Speech Recognition

Acoustic Theory of Speech Production 1

A n a t o m i ca l S t r u ct u r e s f o r S p e e ch P r o d u ct i o n

6 . 3 4 5 Autom atic Speech Recognition

Acous tic T heory of Speech Production 2

Phonemes in American English


PHONEME
/i/
/I/
/e/
/E/
/@/
/a/
/O/
/^/
/o/
/U/
/u/
/5/
/a/
/O/
/a/
/{/

EXAMPLE
beat
bit
bait
bet
bat
Bob
bought
but
boat
book
boot
Burt
bite
Boyd
bout
about

6.345 Automatic Speech Recognition

PHONEME
/s/
/S/
/f/
/T/
/z/
/Z/
/v/
/D/
/p/
/t/
/k/
/b/
/d/
/g/

EXAMPLE
see
she
fee
thief
z
Gigi
v
thee
pea
tea
key
bee
Dee
geese

PHONEME
/w/
/r/
/l/
/y/
/m/
/n/
/4/
/C/
/J/
/h/

EXAMPLE

wet
red
let
yet
meet
neat
sing
church
judge
heat

Acoustic Theory of Speech Production 3

Places of Articulation for Speech Sounds

Palato-Alveolar
Alveolar
Labial
Dental

6.345 Automatic Speech Recognition

Palatal

Velar
Uvular

Acoustic Theory of Speech Production 4

Speech Waveform: An Example

Two plus seven is less than ten


6.345 Automatic Speech Recognition

Acoustic Theory of Speech Production 5

A Wideband Spectrogram

Two plus seven is less than ten


6.345 Automatic Speech Recognition

Acoustic Theory of Speech Production 6

Acoustic Theory of Speech Production


The acoustic characteristics of speech are usually modelled as a
sequence of source, vocal tract lter, and radiation characteristics
UL

Pr
r

UG

Pr (j) = S(j) T (j) R(j)

For vowel production:


S(j) = UG (j)
T (j) = UL (j) / UG (j)
R(j) = Pr (j) / UL (j)
6.345 Automatic Speech Recognition

Acoustic Theory of Speech Production 7

Sound Source: Vocal Fold Vibration


Modelled as a volume velocity source at glottis, UG (j)
Pr ( t )

To = 1/Fo

UG ( f )

1/f2

UG ( t )
f

Men
Women
Children

F0 ave (Hz)
125
225
300

6.345 Automatic Speech Recognition

F0 min (Hz)
80
150
200

F0 max (Hz)
200
350
500

Acoustic Theory of Speech Production 8

Sound Source: Turbulence Noise


Turbulence noise is produced at a constriction in the vocal tract
Aspiration noise is produced at glottis
Frication noise is produced above the glottis
Modelled as series pressure source at constriction, PS (j)
Ps ( f )

0.2 V
D

V : Velocity at constriction
6.345 Automatic Speech Recognition

D: Critical dimension =

4A
A

Acoustic Theory of Speech Production 9

Vocal Tract Wave Equations


Dene:

u(x, t)
U (x, t)
p(x, t)

=
=
=
=

particle velocity

volume velocity (U = uA)

sound pressure variation (P = PO + p)

density of air

velocity of sound

Assuming plane wave propagation (for a cross dimension ),


and a one-dimensional wave motion, it can be shown that
u
p
=

x
t

u
1 p

=
x c 2 t

1 2 u
2 u
= 2 2
2
x
c t

Time and frequency domain solutions are of the form

1
x
x
+

sx/c
sx/c
u(x, s) =
P e
u(x, t) = u (t ) u (t + )
P+ e
c
c
c

x
x
+

p(x, t) = c u (t ) + u (t + )
p(x, s) = P+ esx/c + P esx/c
c
c
6.345 Automatic Speech Recognition

Acoustic Theory of Speech Production 10

Propagation of Sound in a Uniform Tube


A

UG

x = -l

x = 0

The vocal tract transfer function of volume velocities is


UL (j) U (, j)
=
T (j) =
UG (j)
U (0, j)
Using the boundary conditions U (0, s) = UG (s) and P(, s) = 0
T (s) =

es/c

2
+ es/c

T (j) =

1
cos(/c)

The poles of the transfer function T (j) are where cos(/c) = 0


(2fn ) (2n 1)
=

c
2
6.345 Automatic Speech Recognition

c
fn =
(2n1)
4

4
n =
(2n 1)

n = 1, 2, . . .

Acoustic Theory of Speech Production 11

Propagation of Sound in a Uniform Tube (cont)

For c = 34, 000 cm/sec, = 17 cm, the natural frequencies (also


called the formants) are at 500 Hz, 1500 Hz, 2500 Hz, . . .
j
20 log10 T ( j
)

40

x
20

x
0

Frequency ( kHz )

x
x

The transfer function of a tube with no side branches, excited at


one end and response measured at another, only has poles
The formant frequencies will have nite bandwidth when vocal
tract losses are considered (e.g., radiation, walls, viscosity, heat)
The length of the vocal tract, , corresponds to 14 1 , 34 2 , 54 3 , ...,
where i is the wavelength of the i th natural frequency
6.345 Automatic Speech Recognition

Acoustic Theory of Speech Production 12

Standing Wave Patterns in a Uniform Tube

A uniform tube closed at one end and open at the other is often
referred to as a quarter wavelength resonator
x

glottis

lips

|U(x)|
SWP for
F1

SWP for
F2
2
3

SWP for
F3
2
5
6.345 Automatic Speech Recognition

4
5
Acoustic Theory of Speech Production 13

Natural Frequencies of Simple Acoustic Tubes

z-l
x = -l

x = 0

Quarter wavelength resonator


x
P(x, j) = 2P+ cos
c
U(x, j) = j

z-l

x
A
2P+ sin
c
c

x = -l

x = 0

Half-wavelength resonator
x
P(x, j) = j2P+ sin
c
U(x, j) =

A
x
2P+ cos
c
c

A
A
tan
Y = j
cot
Y = j
c
c
c
c
1
A
A
= j
j
/c 1
j 2 = jCA /c 1

MA
c
CA = A/c 2 = acoustic compliance MA = /A = acoustic mass
fn =

c
(2n 1)
4

n = 1, 2, . . .

6.345 Automatic Speech Recognition

fn =

c
n n = 0, 1, 2, . . .
2
Acoustic Theory of Speech Production 14

Approximating Vocal Tract Shapes

[ i ]

A1

l1

[ a ]

[ u ]

A2

l2

6.345 Automatic Speech Recognition

Acoustic Theory of Speech Production 15

Estimating Natural Resonance Frequencies


Resonance frequencies occur where impedance (or admittance)
function equals natural (e.g., open circuit) boundary conditions

UG

A1

A2

l1

UL

l2

Y 1+ Y 2= 0

For a two tube approximation it is easiest to solve for Y1 + Y2 = 0


1
A2
2
A1
tan
j
cot
=0
j
c
c
c
c
1
2 A2
2
1
sin
sin

cos
=0
cos
c
c
A1
c
c
6.345 Automatic Speech Recognition

Acoustic Theory of Speech Production 16

Decoupling Simple Tube Approximations


If A1 A2 , or A1 A2 , the tubes can be decoupled and natural
frequencies of each tube can be computed independently
For the vowel /i/, the formant frequencies are obtained from:
A1

A2

l1

fn =
At low frequencies:
f =

c
n
21

c
A2
2 A1 1 2

l2

plus
1/2

fn =

c
n
22

1
1
2 CA1 MA2

1/2

This low resonance frequency is called the Helmholtz resonance


6.345 Automatic Speech Recognition

Acoustic Theory of Speech Production 17

Vowel Production Example


2

1 cm

1 cm

9 cm

7 cm

8 cm

8 cm

9 cm

972
2917
.
.
.

Formant
F1
F2
F3
.
.

Actual
789
1276
2808
.
.

6.345 Automatic Speech Recognition

6 cm

1093
.
.
.
.

Estimated
972
1093
2917
.
.

268

1944
.
.
.
.

Formant
F1
F2
F3
.
.

2917

.
.
.
.

Actual
256
1905
2917
.
.

Estimated
268
1944
2917
.

Acoustic Theory of Speech Production 18

Example of Vowel Spectrograms


16

0.0
0.1
0.2
Zero Crossing Rate

Time (seconds)
0.3
0.4

0.5

0.6

0.7

kHz 8

16
8 kHz

16

0.5

0.6

0.7

kHz 8

16
8 kHz

Total Energy

0
Total Energy

dB

dB

dB

dB

dB

Energy -- 125 Hz to 750 Hz

dB
Energy -- 125 Hz to 750 Hz

dB
8

Time (seconds)
0.3
0.4

0.0
0.1
0.2
Zero Crossing Rate

dB

Wide Band Spectrogram

kHz 4

4 kHz

Wide Band Spectrogram

kHz 4

4 kHz

0
Waveform

0.0

0.1

0
Waveform

0.2

0.3

0.4

/bit/
6.345 Automatic Speech Recognition

0.5

0.6

0.7

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

/bat/
Acoustic Theory of Speech Production 19

Estimating Anti-Resonance Frequencies (Zeros)

Zeros occur at frequencies where there is no measurable output


ln

UG

Yn

Ap
Yp
lp

An

UN

Ao

Ab

lo

lb

Ac

Yo
lc

Ps A f

UL

lf

For nasal consonants, zeros in UN occur where YO =


For fricatives or stop consonants, zeros in UL occur where the
impedance behind source is innite (i.e., a hard wall at source)

Y1 = 0

Y 3+ Y 4= 0

Zeros occur when measurements are made in vocal tract interior


6.345 Automatic Speech Recognition

Acoustic Theory of Speech Production 20

Consonant Production
Ab

Ac

lb

Ps A f

lc

lf

POLES
+

[g]
[s]

ZEROS
+

Ab
5
5

Ac
0.2
0.5

[g]
poles zeros
215
0
1750 1944
1944 2916
3888 3888
.
.
.
.
6.345 Automatic Speech Recognition

Af
4
4

b
9
11

c
3
3

f
5
2.5

[s]
poles zeros
306
0
1590 1590
3180 2916
3500 3180
.
.

.
.

Acoustic Theory of Speech Production 21

Example of Consonant Spectrograms


0.0
0.1
0.2
16
Zero Crossing Rate
kHz 8

Time (seconds)
0.3
0.4

0.5

0.6

0.7

16
8 kHz

0.0
0.1
0.2
16
Zero Crossing Rate
kHz 8

Time (seconds)
0.4
0.5

0.6

0.7

0.8

16
8 kHz

Total Energy

0
Total Energy

dB

dB

dB

dB

dB

Energy -- 125 Hz to 750 Hz

dB
Energy -- 125 Hz to 750 Hz

dB
8

0.3

dB

Wide Band Spectrogram

kHz 4

4 kHz

Wide Band Spectrogram

kHz 4

4 kHz

0
Waveform

0.0

0.1

0
Waveform

0.2

0.3

0.4

/kip/
6.345 Automatic Speech Recognition

0.5

0.6

0.7

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

/si/
Acoustic Theory of Speech Production 22

Perturbation Theory
Y j

A
Yl

A
for small

Consider a uniform tube, closed at one end and open at the other
l

Reducing the area of a small piece of the tube near the opening
(where U is max) has the same eect as keeping the area xed
and lengthening the tube
Since lengthening the tube lowers the resonant frequencies,
narrowing the tube near points where U (x) is maximum in the
standing wave pattern for a given formant decreases the value of
that formant
6.345 Automatic Speech Recognition

Acoustic Theory of Speech Production 23

Perturbation Theory (contd)


Y j

A
Yl

A
for small
2
c

Reducing the area of a small piece of the tube near the closure
(where p is max) has the same eect as keeping the area xed and
shortening the tube
Since shortening the tube will increase the values of the formants,
narrowing the tube near points where p(x) is maximum in the
standing wave pattern for a given formant will increase the value
of that formant
6.345 Automatic Speech Recognition

Acoustic Theory of Speech Production 24

Summary of Perturbation Theory Results


x

glottis

lips

glottis

lips

|U(x)|
+

F1

SWP for
F1

1
2

(as a consequence of decreasing A)

F2

SWP for
F2

2
3

F3

SWP for
F3
2
5

6.345 Automatic Speech Recognition

4
5

1
2

+
1
2

Acoustic Theory of Speech Production 25

Illustration of Perturbation Theory

6.345 Automatic Speech Recognition

Acoustic Theory of Speech Production 26

Illustration of Perturbation Theory

The ship was torn apart on the sharp (reef)


6.345 Automatic Speech Recognition

Acoustic Theory of Speech Production 27

Illustration of Perturbation Theory

(The ship was torn apart on the sh)arp reef


6.345 Automatic Speech Recognition

Acoustic Theory of Speech Production 28

Multi-Tube Approximation of the Vocal Tract

We can represent the vocal tract as a concatenation of N lossless


tubes with constant area {Ak }and equal length x = /N
The wave propagation time through each tube is =

6.345 Automatic Speech Recognition

x
c

Nc

A7

Acoustic Theory of Speech Production 29

Wave Equations for Individual Tube


The wave equations for the kth tube have the form
c +
x
x
pk (x, t) =
[Uk (t ) + Uk(t + )]
c
Ak
c
Uk (x, t) = Uk+ (t xc ) Uk(t + xc )
where x is measured from the left-hand side (0 x x)

+
U k ( t - ) U k+1( t )

U k ( t + ) U k+1 ( t )

Uk ( t )

Uk ( t )

U k+1 ( t - )

U k+1
(t+)

x
Ak
x
A k+1

6.345 Automatic Speech Recognition

Acoustic Theory of Speech Production 30

Update Expression at Tube Boundaries


We can solve update expressions using continuity constraints at
tube boundaries e.g., pk (x, t) = pk+1 (0, t), and Uk (x, t) = Uk+1 (0, t)

+
Uk (

t)

DELAY

Uk ( t - )

Uk + 1 ( t )

1 + rk

1 - rk

DELAY

U k(

t + )

Uk+1( t - )

DELAY

U k + 1( t + )

rk

- rk

Uk ( t )

DELAY

Uk + 1 (

k th tube

t)

( k + 1 ) st tube

Uk++1 (t) = (1 + rk )Uk+ (t ) + rk Uk+1


(t)

(t)
Uk (t + ) = rk Uk+ (t ) + (1 rk )Uk+1

rk =
6.345 Automatic Speech Recognition

Ak+1 Ak
Ak+1 + Ak

note |rk |1
Acoustic Theory of Speech Production 31

Digital Model of Multi-Tube Vocal Tract


Updates at tube boundaries occur synchronously every 2
If excitation is band-limited, inputs can be sampled every T = 2
Each tube section has a delay of z1/2
+
Uk (

z)

1
2

1 + rk

Uk + 1 ( z )
-rk

rk

Uk ( z )
z

1
2

Uk + 1 ( z )

1 - rk

The choice of N depends on the sampling rate T


2

= N =
T = 2 = 2
Nc
cT
Series and shunt losses can also be introduced at tube junctions
Bandwidths are proportional to energy loss to storage ratio
Stored energy is proportional to tube length
6.345 Automatic Speech Recognition

Acoustic Theory of Speech Production 32

Assignment 1

6.345 Automatic Speech Recognition

Acoustic Theory of Speech Production 33

References
Zue, 6.345 Course Notes
Stevens, Acoustic Phonetics, MIT Press, 1998.
Rabiner & Schafer, Digital Processing of Speech Signals,
Prentice-Hall, 1978.

6.345 Automatic Speech Recognition

Acoustic Theory of Speech Production 34

Das könnte Ihnen auch gefallen