Beruflich Dokumente
Kultur Dokumente
processing
KH WONG,
Rm 907, SHB, CSE Dept. CUHK,
Email: khwong@cse.cuhk.edu.hk
http://www.cse.cuhk.edu.hk/~khwong
Reference books
Overview (lecture 1)
Chapter
Chapter
Chapter
Chapter
1.A
1.B
2.A
2.B
:
:
:
:
Introduction
Signals in time & frequency domain
Audio feature extraction techniques
Recognition Procedures
Chapter 1:
Chapter 1.A : Introduction
Chapter 1.B : Signals in time & frequency
domain
Chapter 1: introduction
Content
Components of A speech
recognition system
Pre-processor
Feature extraction
Training of the system
Recognition
http://developer.android.com/reference/android/speech/Spee
chRecognizer.html
https://chrome.google.com/webstore/detail/voicerecognition/ikjmfindklfaonkodbnidahohdfbdhkn?hl=
en
Audio signal processing Ch1 , v.4b
10
Sampling example
16-bit
Voltage or pressure
range
Voltage
or pressure
65535
0->(216-1)=65535)
digitized levels
Time in ms
Sampling is at 1KHz
0
Audio signal processing Ch1 , v.4b
www.webkinesia.com/games/images/quant.gif
Time in ms
11
time
0
After
sampling
you only
have the
data points
You may
reconstruct
the signal
by joining
the data
points
12
e.g.
http://www.ras.ucalgar
sampled periodically ( 16KHz) by an
y.ca/grad_project_2005
/asph_sampling.jpg
analogue-to-digital converter (ADC)
Each sample converted is a 16-bit data.
Tutorial: For a 16KHz/16-bit sampling
signal, how many bytes are used in 1
second. (=32Kbytes)
If sampling is too slow, sampling may fail
see
13
A speech wave
Time samples
Audio signal processing Ch1 , v.4b
14
Sampling Frequency=FS=44100 Hz
( 42070 samples)
Zoom in to see
1000 samples
Zoom in to see
300 samples
Audio signal processing Ch1 , v.4b
15
Answer:?
16
Speech recognition
hardware
ADC
(Analog to
Digital
Converter)
Speech
Recording
System
DAC
(Digital to
Analog
Converter)
Or
17
Discussion: Conversion
resolution
Music
Speech
18
19
Signal analysis
spectrum
20
Pressure
/output
of mic
time
Spectrogram
The spectrogram
shows the energies of
the frequency contents
aginst time.
Audio signal processing Ch1 , v.4b
Spectrogram
(matlab function Specgram.m)
21
Time
Basic Phonetics
Vowel
/AA/,/I/,/UH/
Diphthongs
/AY/,/AW/
Consonants
-Nasals /M/
-stops /B/,/P/
-fricative /V/,/S/
-whisper /H/
-affricates /JH/,/CH/
22
Phonetic table
http://www.telefonica.net/web2/eseducativa/phonetics/tablea.gif
Audio signal processing Ch1 , v.4b
23
24
framing
Frequency model
Fourier transform
Spectrogram
Audio signal processing Ch1 , v.4b
25
26
Source: http://antoniopo.files.wordpress.com/2011/03/eadweard_muybridge_horse.jpg?w=733&h=538
27
Time framing
Since our ear cannot response to very fast
change of speech data content, we normally
cut the speech data into frames before
analysis. (similar to watch fast changing still
pictures to perceive motion )
Frame size is 10~30ms (1ms=10-3 seconds)
Frames can be overlapped, normally the
overlapping region ranges from 0 to 75% of
the frame size .
28
sn
time
m
N
l=1 (first window),
length = N
Audio signal processing Ch1 , v.4b
29
30
31
http://en.wikipedia.org/wiki/Fast_Fourier_transform
Audio signal processing Ch1 , v.4b
32
2km
N
, and e j cos( ) j sin( ), j 1
2
k 0
Input (time domain) S k 0,1, 2,.. N 1 S 0, S1, S 2, ...S N 1, ( total N samples)
X m Sk e
, m 0,1,2,3,...,
33
Fourier Transform
N 1
X m Sk e
2km
, where m 0,1,2,3,...,
k 0
N
2km
, and
,
2
N
Signal
voltage/
pressure
level
|Xm|= (real2+imginary2)
single freq..
S0,S1,S2,S3. SN-1
Fourier Transform
Time
Spectral envelop
Audio signal processing Ch1 , v.4b
freq. (m)
34
wave)
|Xm|
single freq..
FT
freq.. (m)
time(k)
sk
|Xm|
single freq..
time(k)
Spectral envelop
Audio signal processing Ch1 , v.4b
freq. (m)
35
Power spectrum
envelope is
plot
of the
(Fourier
Transform
ofa a
frame)
energy Vs frequency.
Frequency
domain output
DFT or FFT
Energy
Spectral envelop
First formant
Second formant
time
36
1KHz
2KHz
freq.
N 1
X m Sk e
k 0
2km
N
, m 0,1,2,3,...,
2
e j cos( ) j sin( ), j 1
How to generate a
spectrogram?
Audio signal processing Ch1 , v.4b
37
signal.
Audio signal processing Ch1 , v.4b
input
38
A specgram
39
Freq.
Freq.
40
41
|X(128)|
For
X_magnitude(m)=
Vertically,
-m is the vertical axis
-|X(m)|=X_magnitude(m) is |X(i)|
represented by intensity
Repeat
frame q=2
Frame q=Q
42
Calculate the
q=1,
q=1,
q=2,
q=2,
q=3,
q=3,
q=7,
q=7,
frame
frame
frame
frame
frame
frame
frame
frame
43
sound
file is tz1.wav
High
energy
Bands:
Formants
44
seconds
http://www.cse.cuhk.edu.hk/%7Ekhwong/www2/cmsc5707/tz1.wav
http://www.cse.cuhk.edu.hk/~khwong/www2/cmsc5707/trumpet.wav
http://www.cse.cuhk.edu.hk/%7Ekhwong/www2/cmsc5707/violin3.wav
High
energy
Bands:
Formants
Spectrogram
of
Violin3.wav
Violin has
complex
spectrum
seconds
45
Exercise 1.7
46
Summary
Studied
47
Appendix
48
49
50
51
sn
time
N
l=1 (first window), length = N
Audio signal processing Ch1 , v.4b
52
N 1
2km
N
, m 0,1,2,3,...,
2
X m Sk e
Answer Class
k 0
exercise 1.5: Fourier j
e cos( ) j sin( )
Transform
http://en.wikipedia.org/wiki/List_of_trigonometric_identitie
For (m=0;m<=N/2;m++)
{
tmp_real=0; tmp_img=0;
For(k=0;k<N-1;k++)
{
tmp_real=tmp_real+Sk*cos(2*pi*k*m/N);
tmp_img=tmp_img-Sk*sin(2*pi*k*m/N);
}
X_real(m)=tmp_real;
X_img(m)=tmp_img;
}
From N input data Sk=0,1,2,3..N-1, there will be 2*(N+1) data generated, i.e.
X_real(m), X_img(m), m=0,1,2,3..N/2 are generated.
E.g. Sk=S0,S1,..,S511
X_real0,X_real1,..,X_real256,
X_imgl0,X_img1,..,X_img256,
Note that X_magnitude(m)= sqrt[X_real(m)2+ X_img(m)2]
Audio signal processing Ch1 , v.4b
53
Calculate the
54
X m Sk e
2km
, m 0,1,2,3,...,
k 0
The reason is this:
In theory, m can be any number from -infinity to + infinity (the original Fourier
transform definition) . In practice it is from 0 to N-1. Because if it is outside 0 to
N-1 , there will be no numbers to work on.