Institut f
ur Automatisierungtechnik
Fachgebiet Regelungtheorie und Robotik
Prof. Dr.Ing. J
urgen Adamy
LandgrafGeorgStr. 4
D64283 Darmstadt
Diplomarbeit
Study of blind dereverberation algorithms for realtime
applications
Xavier Domont
Work in cooperation with:
Honda Research Institute Europe Gmbh
D63073 Offenbach/Main
Tutors:
Dr.Ing. Martin Heckmann (HRI)
Dipl.Ing. Bjoern Scholing (TUD)
Juni 2005
Abstract
At Honda Research Institute Europe, an automatic speech recognition system is
developed for the humanoid robot ASIMO. The reverberation effect alters the
perception of speech signals emitted in a room and reduces the performance of
automatic speech recognition. A lot of methods have been proposed in the past
few decades to enhance reverberant speech signals. This diploma thesis studies
the most promising algorithms and discusses if they can be implemented in realtime for real environments.
The existing methods can be classified in two families:
1. Those who estimate directly the clean speech signal and treat reverberations
as disturbances.
2. Those who estimate the room impulse response and invert the estimated
system to recover the clean speech.
These two approaches are compared in this thesis, based on implementations
in Matlab of selected algorithms. The focus of this comparison is set on the
suitability of these algorithms for real environments, where speaker and robot
are moving, and a possible realtime implementation.
Kurzfassung
Am Honda Research Institute Europe wird ein automatisches Spracherkennungssystem f
ur den Roboter ASIMO entwickelt. Hall stort die Sprachqualitat
und senkt deutlich die Ergebnisse bei der Spracherkennung. Seit 30 Jahre sind
viele Methoden vorgeschlagen worden, um Sprachsignale zu verbessern. Diese
Diplomarbeit untersucht die aussichtsreichsten Algorithmen im Hinblick auf
Echtzeitfahigkeit und Anwendbarkeit unter realen Bedingungen.
Es gibt zwei Ansatze dieses Problem zu losen:
1. Das original Sprachsignal kann direkt aus dem beobachtete Signal geschatzt
werden. Der Halleffekt wird als Storung des reinen Signals angenommen.
2. Die Raumimpulsantwort kann bestimmt werden und wird dann anschlieend invertiert, um das originale Sprachsignal zu bekommen.
Diese zwei Ansatze werden in dieser Diplomarbeit verglichen. Daf
ur werden ausgewahlte Algorithmen implementiert. Der Hauptpunkt des Vergleichs war die
Untersuchung der Methoden auf Einsetzbarkeit in echten Umgebungen, in denen
sich Sprecher und Roboter bewegen.
Contents
1 Introduction
17
1.1
17
1.2
18
1.3
19
1.3.1
20
1.3.2
1.4
. . . . . . . .
2.2
22
25
25
2.1.1
25
2.1.2
27
2.1.3
. . . . . . . . . . . . . . .
28
2.1.4
28
Room acoustics . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
2.2.1
31
2.2.2
32
2.2.3
35
2.2.4
36
CONTENTS
2.3
37
2.3.1
37
2.3.2
39
2.3.3
41
3.2
3.3
45
3.1.1
46
3.1.2
47
3.1.3
Dereverberation operator . . . . . . . . . . . . . . . . . . .
48
3.1.4
52
3.1.5
53
3.1.6
57
59
3.2.1
Problem formulation . . . . . . . . . . . . . . . . . . . . .
59
3.2.2
60
3.2.3
62
64
45
67
67
4.1.1
Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
4.1.2
Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
4.1.3
69
4.1.4
70
4.1.5
70
CONTENTS
4.2
Batch method . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
4.2.1
72
4.2.2
Noisy case . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
Iterative method . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
4.3.1
76
4.3.2
Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
4.4
77
4.5
79
4.3
81
. . . . . . . . . . . . . . . . . . .
81
5.1.1
Harmonicitybased dereverberation . . . . . . . . . . . . .
81
5.1.2
81
5.1.3
Channel estimation . . . . . . . . . . . . . . . . . . . . . .
82
5.1.4
82
5.2
83
5.3
83
A Proofs
87
List of Figures
1.1
17
1.2
18
1.3
18
1.4
20
1.5
21
1.6
22
2.1
25
2.2
26
2.3
27
2.4
30
2.5
31
2.6
Measurement method . . . . . . . . . . . . . . . . . . . . . . . . .
32
2.7
32
2.8
33
2.9
33
33
35
12
LIST OF FIGURES
2.12 Spectrograms of an anechoic signal (left) and the resulting spectrogram of its convolution with the impulse response of figure 2.7
(right). This spectrograms were obtained with a Gammatone filterbank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
38
40
41
42
3.1
46
3.2
47
3.3
52
3.4
upleft: Original signal (sweep with harmonics). upright: Reverberant signal. bottomleft: Harmonic estimate with the
Gammatone filterbank. bottomright: Harmonic estimate with
Nakatanis harmonic filter. . . . . . . . . . . . . . . . . . . . . . .
54
56
56
57
3.8
58
3.9
60
61
3.5
3.6
3.7
LIST OF FIGURES
13
61
3.12 (a) A single channel timedomain adaptive algorithm for maximizing the kurtosis of the LP residuals. (b) Equivalent system, which
avoids LP reconstruction artifacts. . . . . . . . . . . . . . . . . . .
63
3.13 Twochannel frequencydomain adaptive algorithm for maximization of the kurtosis of the LP residual. . . . . . . . . . . . . . . .
64
3.14 On the left the LP residual of a clean signal. On the right the LP
residual of the resulting dereverberated signal. The kurtosis of the
dereverberated signal is higher than the kurtosis of the original
signal. The resulting signal is strongly distorted. . . . . . . . . . .
65
4.1
SIMO System . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
4.2
71
4.3
Estimated zeros and real zeros for one channels (left). Zeros of all
the estimated channels. On the left 4 estimated zeros are alone,
they do not correspond to a real pole of the filter. On the right it
can be noticed that these 4 additional zeros are common to all the
estimated channels. . . . . . . . . . . . . . . . . . . . . . . . . . .
72
73
74
Eigenvalues of the correlation matrix in the noisy case. The variance of the noise is equal to 1010 on the left and 106 on the
right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
4.4
4.5
4.6
14
LIST OF FIGURES
4.7
4.8
4.9
77
Iterative estimation of the channel impulse responses using 5 microphones. On the left the estimated zeros (blue) of one of the
channels are compared with their real values (red). On the right
the remaining impulse response after inversion of the system is
drawn (blue), in the ideal case it should a Dirac (red) . . . . . . .
78
79
Acronyms
FFT Fast Fourier Transform
DFT Discrete Fourier Transform
STFT short time Fourier transform
SISO SingleInput SingleOutput
SIMO SingleInput MultipleOutput
MIMO MultipleInput MultipleOutput
ROC Region of Convergence
LTI Linear TimeInvariant
FIR Finite Impulse Response
IIR Infinite Impulse Response
MINT Multiple input inverse filter
Chapter 1
Introduction
1.1
The acoustic signals emitted in a room reflect off the walls and other objects
(see figure 1.1). The direct signal and all the reflected sound waves arrive to
the microphone/listener with a different delay and sum up. This effect is called
reverberation. Sometimes the term echo is used instead of reverberation. However
echo generally implies a distinct, delayed version of a sound. In a room, each
delayed sound wave arrives in such a short period of time that we do not perceive
each reflection as a copy of the original sound. Even though we cant discern
18
1 Introduction
every reflection, we still hear the effect of the entire series of reflections.
Whereas a human being, without hearing problems, can quite well cope with these
distortions, this reverberation effect impairs the speech intelligibility in devices
such as handsfree conferences telephones and automatic speech recognition.
The diagram in figure 1.2 shows how the system can be modeled. The effect of
the room is considered as a filter with impulse response h(t) whose input is the
clean speech signal s(t) and the output is the observed reverberant signal x(t).
s(t) h(t) x(t)
Figure 1.2: General model of a reverberant signal
Figure 1.3 shows the general shape of a room impulse response. The reverberation corrupts the speech by blurring its temporal structure. However, due to the
spectral continuity of speech, the early reflections mainly increase the intensity of
the reverberant speech, whereas the later ones are deleterious to speech quality
and intelligibility.
The aim of the blind dereverberation is to recover the clean signal s(t) out of the
observed reverberant signal x(t). The term blind means that neither the clean
signal nor the impulse response of the room are known before the processing.
1.2
This diploma thesis was written in cooperation with Honda Research Institute
(HRI) Europe. One of the important projects of HRI is the development of the
19
1.3
The audio processing system at HRI uses a Gammatone filterbank. This type
of filterbank is widely used in audio signal processing as it simulate the human
auditory system.
20
1 Introduction
1.3.1
The aim of the peripheral auditory system (see figure 1.4) is to transform a sound
(which is actually a pressure variation in air) into nerve impulses. These impulses
are then conveyed by the auditory nerve to the brain stem. The nerve cells in
the brain stem act as relay stations, eventually conveying nerve impulses to the
auditory cortex.
The outer ear is composed of the pinna (the visible part) and the auditory canal
or meatus. The pinna significantly modifies the incoming sound in a way that
depends on the angle of incidence of the sound relative to the head. This is
important for the sound localization. Sound travels down the meatus and cause
the eardrum, or tympanic membrane, to vibrate. These vibrations are transmitted
through the middle ear by three small bones, the osscicles, to a membranecovered
opening in the bony wall of the spiralshaped structure of the inner ear, the
cochlea.
The cochlea is shaped like the spiral shell of a snail. It is filled with almost incompressible fluids and is divided along its length by two membranes, the Reissners
membrane and the the basilar membrane. The motion of the basilar membrane
in response to a sound is of primary interest.
1.3.2
21
(1.1)
where t is the time (t 0), b determines the duration of the impulse response, n
is the order of the filter and determines the slope of the skirts of the filter, is a
phase and f0 is the center frequency.
It can be observed from figure 1.5 that the Gammatone filter is a bandpass with
its center frequency at f0 . Its bandwidth depends on b.
To simulate the whole basilar membrane, a bank of Gammatone filters can be
used. Each filter channel represents the frequency response of one point on the
basilar membrane.
The parameters of the Gammatone filters are determined out of psychoacoustic
measurements. Glasberg and Moore [2] summarized the equivalent rectangular
bandwidth (ERB) of the human auditory filter. The ERB of a filter is defined as
the width of a rectangular filter whose height equals the peak gain of the filter
and which passes the same total power as the filter.
The relation between the bandwidth and the center frequency of the Gammatone
filters is given by:
ERB = 24.7 + 0.108 f0 .
(1.2)
22
1 Introduction
The figure 1.6 shows the transfer functions for a bank of 16 filters with center
frequency spaced between 50 Hz and 8 kHz. As the spectral resolution of the
basilar membrane decreases as the frequency increases, the center frequencies of
the Gammatone filters are not linearly distributed and their bandwidth increase
with the center frequency according to equation (1.2). We can also note that the
pass bands overlap.
1.4
23
In chapter 3 two methods, which use the properties of speech to enhance reverberant signals, will be studied. These methods consider that the room effect
is a disturbance and try and restore the characteristics of the speech that the
reverberations altered.
In chapter 4 the possibility of estimating the room impulse responses will be
discussed. This approach is very interesting as, knowing the effect of the room on
the signal, it will then make possible to revert this effect and recover the clean
signal.
Chapter 2
Model of a reverberant signal
In terms of signal processing a room can be seen as a filter. The original (anechoic)
signal s(n) goes through a filter h(n) and gives the reverberant signal x(n), see
figure 2.1. In the case of blind dereverberation the input signal s(n) and the room
transfer function h(n) are unknown.
s(n) h(n) x(n)
Figure 2.1: General model of a reverberant signal
The task of dereverberation is to find an estimate s(n) of s(n), given the output
x(n) of the system. In order to make this task feasible, a model of the speech
signal and/or a model of the room are required.
In section 2.1 different ways to model a speech signal will be discussed. In section
2.2 the effects of the room on the speech signal will be investigated. At last section
2.3 will discuss the possibility of inverting the effects of the room.
2.1
2.1.1
The principal components of the human speech production system are (see figure
2.2) the lungs, trachea(windpipe), larynx (organ of voice production), pharyn
26
Figure 2.2: Schematic diagram of the human speech production mechanism (source:
[3])
geal cavity (throat), oral or buccal cavity (mouth), and nasal cavity (nose). The
pharyngeal and oral cavities are usually grouped and referred to as vocal tract.
It is useful to think of speech production in terms of an acoustic filtering operation. The pharyngeal, oral and nasal cavities comprise the main acoustic filter.
This filter is excited by the organs below it, and is loaded at its main output by
a radiation impedance due to the lips. The articulators are used to change the
properties of the system, its form of excitation, and its output loading over time.
Figure 2.3 shows a simplified acoustic model illustrating these ideas.
27
Figure 2.3: Block diagram of the human speech production (source: [3])
2.1.2
The spectral characteristics of the speech wave are nonstationary, since the physical system changes rapidly over time. Speech can therefore be divided into sound
segments which present similar properties over a short period of time. Without
going into further details the main way to classify a speech sound is with the type
of excitation.
The two elementary types of excitation are voiced and unvoiced. There are actually a few other type of excitation (mixed, plosive, whisper, silence) but they can
be seen just as a combination of the two elementary types.
Voiced sounds are produced by forcing air through the glottis, an opening between
the vocals folds. The vocal cords vibrate in oscillatory fashion and, therefore, the
produced speech signal is quasiperiodic, its period is called fundamental period
1
T0 ; the fundamental frequency F0 can be defined as .
T0
28
Unvoiced sounds are generated by forming a constriction at some point along the
vocal tract, and forcing air through the constriction to produce turbulence. The
produced speech signal is a noiselike sound.
Typical human speech communication is limited to a bandwidth of 78 kHz. The
main part of the energy is contained in voiced segments.
2.1.3
A speech signal s(n) can be modeled [4] by using the sum of a harmonic signal
sh (n), derived from a glottal vibration, and a nonharmonic signal sn (n), such as
fricatives and plosives, as
s(n) = sh (n) + sn (n).
(2.1)
The harmonic part of the signal is defined by its voiced durations and their fundamental frequencies (F0 ). A voiced duration is the time during which the vocal
cords vibrate to generate a harmonic signal and the fundamental frequency refers
to the frequency of the fundamental component of the signal. Each harmonic
component has a frequency which corresponds to F0 or its multiples.
It can be assumed that F0 is constant within a short time, therefore the harmonic
signal, sh (h), can be modeled over a time frame of length T by the sum of sinusoidal components whose frequencies coincide with the fundamental frequency of
the signal and its multiples:
N
X
n nc
sh (n) =
Ak cos kF0
+ k
fs
k=1
for n nc  <
T
2
(2.2)
where Ak and k are the amplitude and the phase of the kth harmonic component, nc the time index of the center of the frame and fs the sampling rate.
2.1.4
A widely used model of speech signals is given by Linear Prediction (LP) analysis.
This model consists of separating the speech signal into a excitation signal and
a model of the vocal tract.
29
L
X
i=1
R
X
b(i)z i
(2.3)
a(i)z i
i=1
+
X
(n qP ),
q=
e(n) =
uncorrelated noise,
voiced case
(2.4)
unvoiced case
(
1 if t = 0,
0 else.
(2.5)
(z)
=
(2.6)
R
X
1
a
(i)z i
i=1
Actually the LP model has a minimumphase characteristic. This notion will be discussed
more in details in section 2.3
30
Figure 2.4: DiscreteTime speech production model. (a) True Model. (b) Model to
be estimated using LP analysis. (source [3])
L
X
ak s(n k) + e(n)
(2.7)
k=1
where ak are the LP coefficients. The excitation signal e(n) can be seen in terms
of system identification as the prediction error signal, also called LP residual.
2.2
Room acoustics
This section will firstly present how room impulse responses can be measured
(2.2.1) or simulated (2.2.2). The goal is to obtain a set of real and artificial
31
impulse responses. These data will be useful in the next chapters to test the
dereverberation methods.
In a second time (2.2.3) we will discuss if a general model of a room can be found.
At last (2.2.4), we will use timefrequency analysis to shortly study the effects of
reverberation on speech signals.
2.2.1
In order to get real impulse response corresponding to a normal room, we performed measurements in the office of the CARL group at HRI. A sound signal
was played through the room by a loudspeaker. Simultaneously the sound wave
was recorded using a model of ASIMOs head equipped with two microphones.
For each microphone both the input s(n) and the output x(n) of the system are
known, the impulse response h(n) can be then computed by the reverting the
convolution
x(n) = h(n) s(n).
(2.8)
However the measurement is generally altered by additive noise. To improve the
measurement it is therefore better to use auto and crosscorrelation functions.
Equation (2.8) becomes
Rsx (n) = h(n) Rs (n)
(2.9)
where Rs (n) is the autocorrelation function of s(n) and Rsx (n) the crosscorrelation function of s(n) and x(n). Equation (2.9) is less sensitive to noise.
Moreover, if s(n) is a white noise, its autocorrelation function is equal to (n).
Then
h(n) = Rsx (n), when Rs (n) = (n).
(2.10)
The impulse response of the room is equal to the autocorrelation function of the
white noise, played by the loudspeaker, and signal recorded by the microphone.
For our measurement we used 1 second of Gaussian white noise as room excitation
signal.
32
Two different sound cards were used to play and record the signals. In order to
easily synchronize the input and the output of the system, the excitation signal
was directly recorded on another channel of the capture sound card in addition
to the signals of the microphones (see figure 2.6).
x(n)
r 
h(n)
 y(n)
 x(n)
Moreover this method permits to compensate eventual effects of the sound cards.
As we only disposed of a stereo sound card, the record for the left and right ears
had to be performed separately. Figure 2.6 shows one of the measured impulse
response.
2.2.2
33
is reflected off a wall and then impinges upon the microphone. This reverberated
sound seems to come directly from a virtual source located in an adjacent room,
symmetrical to the original room relatively to the wall (see figure 2.9). On this
figure the black line represents the real path of the signal, whereas the blue line
is its perceived path.
H
HH
H
This process can be extended to sound waves that are reflected more than once
off the walls (see figure 2.10). This process can be continued the same way in the
HH
H
XX
H
XXX
XXX
XX
Figure 2.10: Image Method: Sound wave reflecting off two walls
34
The virtual sources permit to easily compute the distance the sound wave travels
to arrive at the microphone.
Considering a rectangular room with dimensions (Lx , Ly , Lz ), the coordinatesvector ri,j,k = (xi , yj , zk )T ((i, j, k) Z3 ) of a virtual source is:
1 (1)i
i
xi = (1) xsource + i +
Lx
2
1 (1)j
j
(2.11)
yj = (1) ysource + j +
Ly
2
1 (1)k
k
zk = (1) zsource + k +
Lz
2
where rsource = (xsource , ysource , zsource )T is the coordinatevector of the source.
The distance from the virtual source to the microphone is
q
di,j,k = kri,j,k rm k = (xi xm )2 + (yj ym )2 + (zk zm )2
(2.12)
1
4d2i,j,k
(2.15)
(2.16)
where < 1 is the wall reflection coefficient (which is, in this simple model,
considered to be the same for all the walls).
hi,j,k = bi,j,k ci,j,k
(2.17)
35
Although the impulse response of the room should contain an infinite number of
delayed impulses, corresponding to an infinity of virtual sources, the magnitudes
hi,j,k become very small for large i j and k. The impulse response has then a
finite time duration:
n
n
n
X
X
X
di,j,k
h(t) =
hi,j,k t
(2.18)
c
i=n j=n k=n
Figure 2.11: Room impulse response simulated with the image method
Figure 2.11 shows a simulated room impulse response obtained with the image
method. Reverberant sounds generated using such an impulse response sound like
signals recorded in real conditions. However phenomenon like the phase inversion
of the sound wave when it reflects off a wall, the presence of objects or people
are ignored by this model.
2.2.3
The general shape of the measured and the simulated room impulse responses
corresponds to the one described on figure 1.3. However, when the conditions in
the room are changing (movement of the talker and/or listener) the coefficients
of the impulse response have big fluctuations especially in the late reverberation
tail. As we explained in chapter 1, the distortions in the speech signal are mostly
due to the late reverberation. Therefore, a model based on the image method,
where the room impulse responses are modeled by a sum of delayed impulses
hi,j,k (t i,j,k ), is not practical for a system identification.
Actually the only general properties which can be retained from a room impulse
36
response are its linearity, its causality (there is no reverberation before the beginning of the signal) and its general exponential decay structure.
In real environments, the talker or the listener are in movement, therefore the
effect of the room is timevariant. However if we assume that the computation is
fast enough the system can be considered Linear TimeInvariant (LTI).
Moreover, because of the exponential decay, the impulse response of the room has
a finite duration. The room is then modeled by a Finite Impulse Response (FIR)
filter. The relation between the input s(n) and the output x(n) is given by the
convolution
L1
X
x(n) = h(n) s(n) =
h(k)s(n k),
(2.19)
k=0
where L is the length of the impulse response (also called order of the channel).
Actually, the FIR model of the room impulse response is very practical as the
transfer function of the system, i.e. the ztransform of its impulse response,
H(z) =
L1
X
h(k)z k
(2.20)
k=0
2.2.4
37
Figure 2.12: Spectrograms of an anechoic signal (left) and the resulting spectrogram
of its convolution with the impulse response of figure 2.7 (right). This spectrograms
were obtained with a Gammatone filterbank.
time of the room, i.e. the time for the sound to die away to a level of 60 dB
below its original level. In the frequencydomain, the reverberation affects slightly
the adjacent channels. According to [7], this effect has the form of a Laplace
distribution.
2.3
2.3.1
(2.21)
= s(n)
which can be simplified to
g(n) h(n) = (n)
(2.22)
38
The inversion problem can be studied with the help of the ztransform. The ztransform of h(n), also called transfer function of the filter, is defined as the power
series
+
X
H(z) =
h(k)z k
(2.23)
k=
It was shown in section 2.2 that the room can be considered as a FIR filter. Then
its ztransform is a polynomial
H(z) = h0 + h1 z 1 + + hN z L+1
(2.24)
(2.25)
h0 + h1 z
1
+ + hN z L+1
(2.26)
The Infinite Impulse Response (IIR) filter G(z) is causal and stable if and only if
all its poles are inside the unit circle (z = 1). As the poles of G(z) are the zeros
of H(z), this means that all the zeros of H(z) must be inside the unit circle. Such
a system is called minimumphase.
In order to understand this problem, we can observe what happens if we want to
invert a simple non minimumphase system.
Given the FIR filter h(n), defined in the timedomain by:
h(n) = (n) 2(n 1)
Its transfer function is
H(z) = 1 2z 1
39
The Region of Convergence (ROC) of this ztransform is z > 0. As this system
has a zero at z = 2, it is non minimumphase.
The transfer function of its inverse system is:
G(z) =
1
z
=
1 2z 1
z2
G(z) has a zero at the origin and a pole at z = 2. In this case there are two
possible regions of convergence and hence two possible inverse systems. If the
ROC of G(z) is taken as z > 2, then
g(n) = 2n u(n)
where u(n) is the unit step function
u(n) =
(
1 if n 0,
0 else.
(2.27)
This is the impulse response of a causal and instable system. On the other hand,
if the ROC is assumed to be z < 2, the impulse response of the inverse system
is
g(n) = 2n u(n 1).
In this case the inverse system is anticausal and stable.
2.3.2
(2.28)
40
1
z0
= 1r e
= 1r e
I
@
unit circle
Figure 2.14: Pole () and zero () of an allpass filter
(2.29)
h(n)
m
X
hmin (n)2 ,
m N
(2.30)
n=0
The energy of both sequences is the same since the magnitude of their Fourier
3
41
Figure 2.15: Energy of a nonminimum phase system (dashed  blue) and the corresponding minimumphase system (red).
transforms is the same (by Parsevals Theorem). This means that the equality
occurs in (2.30) when m .
The room transfer functions have often more energy in the reverberant component of the room impulse response than in the component corresponding to the
direct path (see figure 1.3). This implies that room transfer function are often
nonminimumphase. A causal and stable inverse of a room impulse response is
therefore impossible to find in general. The noncausality problem can be solved
by introducing a delay, i.e. a delayed inverse filter is computed instead. However
the delay have to be generally quite long, and this is not satisfying for realtime
applications.
2.3.3
As the room transfer functions are most of the time nonminimumphase a perfect
dereverberation cannot be achieved with a single microphone. It is possible to
find a delayed inverse filter but this solution is not really adequate for realtime
processing.
However it is possible to find the exact inverse of a point in the room by us
42
(2.31)
where the orders of the Gi (z)s are smaller or equal than the highest order of
the Hi (z)s. Figure 2.16 shows how equation (2.31) can be used to invert the M
channels simultaneously. This method is called Multipleinput/output INverse
Theorem (MINT).

H1 (z)
G1 (z)
H2 (z)
G2 (z)
L
s(n)
..
.
 HM (z)
 s
(n) s(n)
..
.
 GM (z)
By using more than one microphone, the issue that room transfer functions are
non minimumphase is bypassed. Moreover the inverse filters are simple FIR
filters, which can be computed by solving the linear system
d = HT1 HT2 HTM g = Hg
(2.32)
where d = [1, 0, . . . , 0] is a vector of length 2L 1, g is the concatenation of the
vector gi = [gi (0), . . . , gi (L 1)]T corresponding to the inverse filters
T T
g = g1T . . . gM
(2.33)
and the Hi s are the L (2L 1) Sylvester matrices corresponding to the polynomials Hi (z)
hi (0) hi (L 1)
0
..
...
...
Hi = ...
(2.34)
.
0
hi (0)
hi (L 1)
4
43
A Sylvester matrix permits to compute a convolution (or a polynomial multiplication) with a matrix multiplication. Given two signals x(n) and y(n), respectively
of length Lx and Ly , the convolution z(n) of x(n) and y(n) has Lx + Ly 1
samples and can be written in a vector form as
z = XT y = YT x,
(2.35)
Chapter 3
Enhancement of a speech signal
Reverberation produces a distortion that alters the intelligibility of speech. A
possible approach to the dereverberation problem is to consider the general properties of a speech signal, which are degraded by the reverberation.
A simple way to improve the reverberant signal is, for example, to detect reverberation tails between words. By removing, or attenuating, these parts which
only contain reverberation, the listening comfort is slightly improved. However,
this method, which is used in hearing aids, does not remove the distortion which
alters the words.
The two methods presented in this chapter use, more or less explicitely, the
harmonicity property of the voiced segments of a speech signal, in order to try
and recover the clean signal. In section 3.1 the approach of Nakatani[4], using a
adaptive harmonic filter, will be described. In section 3.2 an adaptive algorithm
working on the LP residual of the speech signal will be presented.
3.1
Nakatani and al. propose in [4] an interesting single microphone dereverberation method called Harmonicity based dEReverBeration (HERB). This method
is based on the harmonicity model of speech described in section 2.1.
The principle is to estimate a dereverberation operator using the harmonic parts
of the speech signal. This operator, initially designed for the harmonic parts, is
expected to work on the nonharmonic parts as well.
46
The performance of this method, presented in [4] and [10], are impressive. In this
section we will begin by describing the principle of this dereverberation process.
Then we will discuss its applicability on ASIMO.
3.1.1
In order to understand the basic idea of the HERB method, it is useful to observe
the effect of reverberation on a sweeping sinusoid. In discrete time the sinusoidal
sweep is define by
!
2
k n
n
s(n) = A sin 2
(3.1)
+ fstart
2 fs
fs
where A is the amplitude, fstart is the frequency at t = 0, fs is the sampling
frequency, and k a constant. Its instant frequency varies linearly in time:
n
(n)
= k + fstart .
fs
(3.2)
Figure 3.1 (left) shows the spectrogram of a half second long discrete signal
which frequency sweeps from 100 to 4000 Hz. This spectrogram is obtained using
a Gammatone filterbank, therefore is the frequency scale not linear (see (1.2)).
This signal is then convoluted with the impulse response shown on figure 2.7. The
resulting spectrogram is shown on figure 3.1 (right). We can observe, from this
spectrogram, that the sinusoidal component corresponding to the original signal
can be clearly identified. In each frequency band, the energy corresponding to this
47
amplitude A(l)
and phase (l)
of this dominant sinusoidal are extracted and
used to synthesize the signal
X
n
cos (l)
s(n) =
g(n nl )A(l)
+ (l)
,
(3.3)
f
s
l
where g(n nl ) is a window function for overlapadd synthesis and nl is the time
index centered in frame l.
3.1.2
Although a sweep signal contains only one dominant sinusoidal, a harmonic signal contains several sinusoidal components whose frequencies correspond to its
fundamental frequency F0 and its multiples (cf. section 2.1). The aim of a harmonic filter is to enhance these components. Since the fundamental frequency of a
speech signal changes over time, the properties of the filter have to be adaptively
modified according to F0 (see figure 3.2).
x(n)
 Harmonic
Filter
 x
(n)
6
 Estimation
of F0
Figure 3.2: Adaptive harmonic filtering
A simple approach of harmonic filtering is the comb filter defined as 1+z where
is the period to be enhanced. The method proposed by Nakatani in [4] is to
filter the signal by synthesizing a harmonic sound as follows:
1. The fundamental frequency of the observed signal is estimated at each time
frame. If the time frame is short enough this fundamental frequency can be
considered constant.
48
(3.5)
(3.6)
where l is the index of the time frame, nl is the time index corresponding
to the center of the frame, m is the index of the frequency bin, M is the
number of points used for the Discrete Fourier Transform (DFT), Ak,l and
k,l are respectively the estimated amplitude and phase of the kth harmonic
component, F0,l is the fundamental frequency of the time frame, g1 (n) is
an analysis window function and [] discretizes a continuous frequency into
the index of the nearest frequency bin.
3. The output of the filter, x(n), is obtained by adding sinusoids
X
n nl
(3.7)
g2 (n (nl + lT ))
xl (n),
(3.8)
3.1.3
Dereverberation operator
Harmonic case
The dereverberation operation is computed in the frequency domain using the
short time Fourier transform. Let X(l, m) be the STFT of a reverberant signal.
49
X(l, m) can be represented as the product of the source signal, S(l, m), and the
room transfer function, H(m), which is assumed to be time invariant (cf. section
2.2). This transfer function can be divided into two function, D(m) and R(m).
The former corresponds to the direct signal, D(m)S(l, m), and the latter to the
reverberant part, R(m)S(l, m):
X(l, m) = H(m)S(l, m)
= D(m)S(l, m) + R(m)S(l, m)
(3.9)
The aim of the dereverberation operator is to estimate the direct signal X 0 (l, m) =
D(m)S(l, m).
It can be obtained by subtracting the reverberant part R(m)X(l, m) from equation (3.9), or by finding the inverse filter W (m) such that
W (m) =
D(m)
H(m)
(3.10)
Then
X 0 (l, m) = W (m)X(l, m)
D(m)
=
H(m)S(l, m)
H(m)
(3.11)
= D(m)S(l, m).
The basic idea of the HERB method is the following: if S(l, m) is a harmonic
signal, the direct signal, contained in X(l, m) can be obtained using an adaptive
harmonic filter. At each time frame l an inverse filter W0 (l, m) is computed in
m) of the harmonic filter:
the frequency domain using the output X(l,
W0 (l, m) =
m)
X(l,
X(l, m)
(3.12)
50
General case
This process can be applied on a speech signal S(l, m) by rewriting the equation
(2.1) in frequency domain
S(l, m) = Sh (l, m) + Sn (l, m)
(3.14)
where Sh (l, m) is the harmonic part and Sn (l, m) is the nonharmonic part.
The observed reverberant signal X(l, m) is rewritten as
X(l, m) = D(m)Sh (l, m) + (R(m)Sh (l, m) + H(m)Sn (l, m))
(3.15)
where H(m) is the transfer function of the room, H(m) = D(m) + R(m).
The component D(m)Sh (l, m) can be approximately extracted from X(l, m) with
m) can be modan adaptive harmonic filter. This approximated direct signal X(l,
eled as:
h + H
n
DSh + R
X
=
X
HS
Hn /Sn Sn
D + Rh /Sh Sh
=
+
H (Sh + Sn )
H (Sh + Sn )
h /Sh
n /Sn
D+R
1
H
1
=
+
H
1 + Sn /Sh
H 1 + Sh /Sn
W0 =
(3.18)
(3.19)
(3.20)
1
It can be proven (see in appendix A) that the expected value of a function 1+z
,
where z is a complex random variable, is equal to the probability that z < 1, if
51
it is assumed that the phase of z is uniformly distributed, the phase of z and its
absolute value are statistically independent, and z =
6 1.
Then, under the following conditions [10]:
h (l, m) and
1. The phase of Sn (l, m) and a joint event composed of Sh (l, m), R
Sn (l, m) are independent,
n (l, m)
2. The phase of Sh (l, m) and a joint event composed of Sh (l, m), H
and Sn (l, m) are independent,
3. The phase of Sh (l, m) and Sn (l, m) are uniformly distributed within [0, 2),
4. Sh (l, m) =
6 Sn (l, m),
the equation (3.17) can be derived as
W (m)
D(m) + R(m)
P {Sh (l, m) > Sn (l, m)},
H(m)
(3.21)
where
(
R(m)
= Et
h (l, m)
R
Sh (l, m)
)
,
(3.22)
Given equation (3.21) W (m) is expected to remove reverberation not only of the
harmonic components of the speech signal but also of the nonharmonic ones. It
D(m)
approximates the inverse filter H(m)
except for a remaining reverberation due to
R(m).
The enhanced signal is then expected to be the sum of the direct signal and
a reduced reverberation part. However, because of the term P {Sh (l, m) >
52
Sn (l, m)}, the gain of the filter becomes zero in frequency regions where harmonic components were not included during the estimation of the dereverberation
operator.
In addition, even in frequency regions in which harmonic components were present
during the estimation phase, the filter gain is expected to decrease as the frequency increase. The reason is that in a speech signal the energy of higher order
harmonic components is smaller than the energy of the sinusoidal component at
the fundamental frequency and its small multiples.
3.1.4
The figure 3.3 summarize the dereverberation process using the HERB method.
S(l, m)
 H(m)
 W (m)
 X(l, m)
 S(l,
m)
@
@
?
Harmonic
Filter
R n
@
X(l,m)
X(l,m)
?
m)
X(l,
Figure 3.3: Diagram of the HERB dereverberation method
53
3.1.5
The performance of this method described in [4] and [10] are impressing. We have
implement this method to test two points:
Can this method easily be adapted to replace the STFT by a Gammatone
filterbank?
Can the process work in realtime and real environments?
For HRI the interest of this method resides in the fact that the fundamental
frequency of the signal is already computed for other processes. A pitch tracking
algorithm has already been developed at HRI [11] and can valuably be used to
estimate the fundamental frequency of the signals. As the computation of the
fundamental frequency seems to be the critical point of the HERB algorithm, we
expected a lot from this method.
Harmonic Filter
The aim of this section is to compare the harmonic filter proposed by Nakatani in
[4] and an implementation of the harmonic filter on the Gammatone filterbank.
54
The implementation of the harmonic filter on the filterbank is simple. The output of the Gammatone filter bank are N signals corresponding to the different
frequency channels of the cochlea response. Knowing the fundamental frequency,
the frequency channels corresponding to F0 and its multiple are determined. At
a time sample t the signals of these channels are added.
The resulting spectrograms (see figure 3.4) shows that both implementation of
the harmonic filter give similar results.
Figure 3.4: upleft: Original signal (sweep with harmonics). upright: Reverberant
signal. bottomleft: Harmonic estimate with the Gammatone filterbank. bottomright: Harmonic estimate with Nakatanis harmonic filter.
55
Dereverberation operator
In order to compute the dereverberation operator a training sequence is required.
During this adaptation phase the room impulse response must not change.
In the simulation the training sequence was composed of several sweeping sinusoid with their harmonics similar as the one shown on figure 3.4. The operator is
computed using a shorttime Fourier transform. The restriction on the time window is really strong. It has to be long enough to contain a whole word (sweep)
including the complete reverberation tail.
It is at this point important to note that the window of the shorttime Fourier
transform for the harmonic filter and for the estimation of the dereverberation
operator can not be the same. For the harmonic filtering the analysis window must
be as small as possible in order to respect the assumption that the fundamental
frequency of the signal is constant. On the other hand, for the estimation of the
filter W (m), the time window of the STFT must be of several seconds.
It is also assumed that, during the adaptation phase of the dereverberation filter, the pause between the words is long enough. Hence the reverberation tail
of a word would alter the the following word. Respecting these conditions the
dereverberation operator can be computed.
In order to estimate the performance of the algorithm, 500 random harmonic
signals (sweeps with harmonics) of 0.5 s each are used as training data. These
signals are convolved with the room impulse response shown in figure 2.7. As the
exact fundamental frequencies of these signals are known, a good estimation of
the dereverberation operation can be expected. This dereverberation operator is
then used to enhance a real speech signal convolved with the same room impulse
response (see figure 3.5).
Figure 3.6 shows the spectrogram of the enhanced signal. It is here important
to note that the speech signal used for the test contains only one word. As the
time window of the STFT used to compute the dereverberation operator is long
enough to contain a whole word, the enhanced signal is obtained by multiplying
the FFT of the whole reverberant signal with the dereverberation operator.
The dereverberation works relatively good. However, we can see on the spectrogram that the dereverberation filter is not causal. This is not surprising as we
explained in section 2.3 that room impulse responses are in general non minimumphase. Because of the non causality the beginning of the signal is altered. It can
56
Figure 3.5: Spectrogram of the clean and reverberant signal used to test the reverberation operator.
Figure 3.6: Spectrogram of the enhanced signal computed in the frequency domain.
57
Figure 3.7: Impulse response of the dereverberation operator and spectrogram of the
enhanced signal computed in the time domain.
3.1.6
58
3.2
59
The dereverberation using the harmonicity of the signal require too much training
data. Therefore, in this section, another dereverberation method will be discussed.
This method uses the autoregressive (AR) model of speech signals. Several methods have been proposed using linear prediction (LP) analysis [13] [14] [15].
3.2.1
Problem formulation
In section 2.1.4 it was explained that a speech signal s(n) can be expressed as
a linear combination of its L past sample values. The clean and the reverberant
speech become, respectively,
s(n) =
p
X
ak s(n k) + es (n),
(3.24)
bk x(n k) + ex (n),
(3.25)
k=1
x(n) =
p
X
k=1
where ak and bk are the LP coefficients and es (n) and ex (n) the LP residual signal
(or prediction error).
The important assumption on which dereverberation methods using LP analysis
are based is that the LP coefficients are unaffected by the reverberation:
k [1, L] N.
b k = ak
(3.26)
Actually this assumption holds only in a spatially averaged sense [16], i.e. using
several microphones:
E {bk } = ak
k [1, L] N.
(3.27)
L
X
bk s(n k) + e(n),
(3.28)
k=1
i.e. the estimated LP coefficients obtained by linear prediction analysis are used
to synthesize a signal out of the enhanced excitation signal e(n).
60
3.2.2
Figure 3.9: Example of platykurtic (left) and leptokurtic (right) distributions. Both
distributions have the same standard deviation
Gillespie shows in [14] that the kurtosis of the LP residual is a valid reverberation
metric. The kurtosis 2 of a random signal x(n) is the degree of peakness of the
distribution, defined as the the fourth central moment 4 normalized by the fourth
power of the standard deviation (or the square of the variance):
E (x(n) )4
4
2 = 4 =
(3.29)
2
E (x(n) )2
where = E {x(n)} is the mean value of x(n).
As the kurtosis of a normal distribution is equal to 3, a kurtosis excess , denoted
2 and defined by
4
2 = 4 3
(3.30)
is often used. A distribution with a high peak 2 > 0 is called leptokurtic, a flattopped curve 2 < 0 is called platykurtic, and the normal distribution is called
mesokurtic.
Figure 3.9 illustrates the kurtosis measure. The distribution on the right is more
peaked at the center, we tend to conclude that it has a lower standard deviation.
But, on the other hand, it has thicker tails, which usually means that it has a
higher standard deviation. If the effect of the peakness exactly offsets that of the
thick tails, the two distributions will have the same standard deviation.
For a clean voiced speech, the LP residuals have strong peaks corresponding to
glottal pulses (see figure 3.10), whereas for reverberated speech such peaks are
spread in time. On figure 3.11, the probability density function of a clean signal
and of the convolution of this signal with the room impulse response computed
in the CARL Groups office (see figure 2.7) are estimated. Both signals have
been centered and normalized such that their means equals 0 and their standard
61
Figure 3.10: On the left, extract of the LP residuals of a speech signal. Note the strong
peaks corresponding to the glottal pulses. On the right, the same signal impaired by
reverberations.
62
3.2.3
In the timedomain
In order to enhance the reverberant signal x(n) an adaptive filter can be built,
which maximizes the kurtosis of its LP residual x(n). Given an Ltaps adaptive
filter h(n) at time n, the output of this filter is y(n) = hT (n)
x(n), where x
(n) =
T
[
x(n L + 1), . . . , x(n 1), x(n)] . An LP synthesis filter yields y(n), the final
processed signal. Adaptation of h(n) is similar to a traditional LeastMeanSquare
(LMS) adaptive filter [17], except that the optimized value is a feedback function
f (n), corresponding to the gradient of the kurtosis.
Figure 3.12 (a) shows a diagram of the maximization system. The problem of
this algorithm is the LP reconstruction artifacts. However, this system is linear
and the order of the filters can be arbitrary changed, then h(n) can be computed
from x(n), but applied directly to x(n) (see figure 3.12 (b)).
A gradient method can be used to optimize the kurtosis. The gradient of the
kurtosis is computed by
2
4 (E {
y 2 } E {
y3x
} E {
y 4 } E {
yx
})
=
3
2
h
E {
y }
(3.31)
!
4 (E {
y 2 } y2 E {
y 4 }) y
x
= f (n)
x(n)
E 3 {
y2}
(3.32)
Were f (n) is the feedback function used to control the filter updates. For continuous adaptation, the expected values E {
y 2 } and E {
y 4 } are estimated recursively
by
E y2 (n) = E y2 (n 1) + (1 )
y 2 (n)
(3.33)
4
4
4
y (n)
(3.34)
E y (n) = E y (n 1) + (1 )
63
Figure 3.12: (a) A single channel timedomain adaptive algorithm for maximizing the
kurtosis of the LP residuals. (b) Equivalent system, which avoids LP reconstruction
artifacts.
(3.35)
64
3.3
The maximization of the kurtosis permits a realtime dereverberation. The adaptation is quick if a small adaptation filter is used. However, in the case of strong
reverberation the improvement on the signal is not perceptible.
If the length of the adaptive filter is increased, the kurtosis is still maximized and
the algorithm converges to a signal with maximum kurtosis. But the resulting
signal has sometimes a higher kurtosis than the original clean signal. The sound is
strongly distorted and sometimes not understandable anymore. Figure 3.14 shows
the original LP residual of the clean signal. This signal is artifically reverberated
and then enhanced by maximizing the kurtosis of the LP residuals. The resulting
LP residual has a higher kurtosis than the original one. This means that the
maximization has to be constrained. The clean speech has a higher kurtosis than
the reverberated one, but this does not mean that the signal with the highest
kurtosis is the clean signal.
Actually the length of the adaptive filter must not be longer than the period
of the glottal pulses. With this constraint, the efficiency of dereverberation is
limited.
Another drawback of this method is that the LP analysis, as explained in section
2.1.4, is a very good approximation of the magnitude spectrum of the speech
signal but strongly alters the phases spectrum. As the phase is crucial for source
localization, it should be studied if this method does not alter dramatically the
phase information of the signal.
65
Figure 3.14: On the left the LP residual of a clean signal. On the right the LP residual
of the resulting dereverberated signal. The kurtosis of the dereverberated signal is
higher than the kurtosis of the original signal. The resulting signal is strongly distorted.
Chapter 4
Equalization of room impulse
responses
In chapter 3 the dereverberation approach considered the effect of the room as
a distortion which alters the harmonicity of the speech signal. This chapter will
discuss methods to estimate room impulse responses. These estimated impulse
responses can then be equalized (inverted) in order to recover the original clean
speech signal (see section 2.3).
In section 4.1 the principle of a channel estimation method using the second
order statistics of the observed signals will be explained. Then, in sections 4.2
and 4.3, two different implementations of this principle will be discussed. At last,
in section 4.4, some improvement ideas will be proposed.
4.1
Some methods have been proposed to estimate one channel. For example, Hoopgood proposes in [19] a single channel estimation method based on the nonstationarity of speech and the stationarity of the room impulse response. However, in most of the cases, these methods require that the input signal is white
noise, which is not the case for a speech signal. On contrary the estimation of
several room impulse responses simultaneously is possible [20]. Moreover as it
was explained in section 2.3, it is much easier to find a global inverse for two or
more room impulse responses than the inverse of a single one. In this section a
method will be presented, which permits to estimate the impulse responses of a
68
4.1.1
Hypothesis
In [20] Tong and al. show that a SingleInput MultipleOutput (SIMO) system
can be identified under the following conditions.
1. The autocorrelation matrix of the source signal is of full rank.
2. The channel transfer functions do not share any common zeros.
4.1.2
Basic idea
The relation between the input and the outputs of a SIMO system (see figure
4.1) is:
xi (n) = hi (n) s(n) i [1, M ]
(4.1)
s(n)
h1 (n)
h2 (n)
..
.
..
.
hM (n)
xM (n)
x1 (n)
x2 (n)
..
.
(4.2)
where
T
xi (n) = xi (n), xi (n 1), . . . , xi (n L + 1) ,
hi (0) hi (L 1)
0
..
..
..
..
Hi = .
.
.
.
0
hi (0)
hi (L 1)
T
s(n) = s(n), s(n 1), . . . , s(n 2L + 2) ,
(4.3)
(4.4)
(4.5)
69
RxM x1
i6=1 Rxi xi PRx2 x1
Rx x
RxM x2
1 2
i6=2 Rxi xi
Rx =
(4.6)
.
.
.
.
.
.
.
.
.
.
.
.
P
Rx1 xM
Rx2 xM
i6=M Rxi xi
where Rxi xj = E xi (n)xTj (n) are the auto and crosscorrelation matrices of
the observed signals. The matrices Rxi xj can be written as
R xi xj =
1
Xi XTj
T
(4.7)
1
1X
1
Xi XTi h1 X2 XT1 h2 XM XT1 hM =
T i6=1
T
T
1X
Xi XTi h1 XT1 hi
T i6=1
A left multiplication by the transpose of a Sylvester matrix is a convolution, then
the term XTi h1 XT1 hi actually equals
xi (n) h1 (n) x1 (n) hi (n) = s(n) hi (n) h1 (n) h1 (n) hi (n)
(4.8)
and, as the convolution of real signals is commutative, this term equals zero. The
same development can be performed for the other rows of the matrix product
Rx h and it gives:
Rx h = 0,
(4.9)
which means that the vector h lies in the null space of the matrix Rx .
4.1.3
There are then two distinct approaches to identify the SIMO system:
70
4.1.4
The second hypothesis, which requires that the channel transfer functions do not
share any common zeros, can be explained as follows:
Given for example two channels with impulse responses h1 (n) and h2 (n). If the
transfer function of these channels share common zeros then the impulse response
can be rewritten as
1 (n)
h1 (n) = d(n) h
2 (n)
h2 (n) = d(n) h
(4.10)
(4.11)
where d(n) is by analogy with polynomials the greatest common divisor of h1 (n)
1 (n) and h
1 (n) are coprime (do not
and h2 (n), and the transfer functions of h
share any common zeros). Then x1 (n) and x2 (n) become
1 (n)
x1 (n) = s(n) d(n) h
(4.12)
2 (n),
x2 (n) = s(n) d(n) h
(4.13)
and, if the correlation matrix of s(n) d(n) is full rank, the methods will identify
1 (n) h
2 (n)] instead of [h1 (n) h2 (n)].
the system [h
4.1.5
Both the batch and iterative implementations require that the lengths of the
channels are given. The estimation of these lengths is very important as we will
explain in this subsection.
In the two microphones case, the channel estimation tries to find two FIR filters
g1 (n) and g2 (n) of lengths Lg + 1 such that
h1 (n) g2 (n) h2 (n) g1 (n) = 0.
(4.14)
71
where h1 (n) and h2 (n) are the two unknown FIR filters we want to identify. The
lengths of these filters are equal to Lh + 1, which is also unknown.
In the zdomain, the relation between the filters can be written as an operation
on polynomials
H1 (z)G2 (z) = H2 (z)G1 (z).
(4.15)
The polynomials H1 (z)G2 (z) and H2 (z)G1 (z) are equal if and only if they have
exactly the same Lh + Lg zeros.
As H1 (z) and H2 (z) do not share common zeros, each zero of H1 , resp. H2 (z)
must also be a zero of G1 (z), resp. G2 (z). G1 (z) and G2 (z) contain at least Lh
zeros and therefore Lg Lh .
When Lg = Lh , the method return directly the estimated channel. However,
when the lengths of the filters (or channel order) is overestimated, additional
zeros appear. Figure 4.2 illustrates the system in the twomicrophones case.
Null Space

H1 (z)
x1 (n) 
H2 (z)
El (z)
L
s(n)

H2 (z)
x2 (n) 
H1 (z)
 0
El (z)
If we look at the estimated zeros of the channels in the zplane, we can observe
these additional zeros on each channel (see figure 4.3 (left)). However these additional zeros are common to all the estimated channels (see figure 4.3 (right)).
The channel identification works if the channel orders are overestimated and
the additional zeros in the transfer functions cause a distortion which can be
removed. On contrary, if the channel orders are underestimated, the method will
not manage to estimate properly the channels. The relation between two channels
(equation (4.14)) can not be satisfied. The positions of the zeros of the estimated
channels largely differ from the positions of the zeros of the real channels. The
consequences on the estimated impulse responses, and especially on their inverses,
are disastrous.
72
Figure 4.3: Estimated zeros and real zeros for one channels (left). Zeros of all the
estimated channels. On the left 4 estimated zeros are alone, they do not correspond
to a real pole of the filter. On the right it can be noticed that these 4 additional zeros
are common to all the estimated channels.
4.2
Batch method
The batch method for SIMO system identification is well described in [21]. The
principle is to compute the eigenvalues of the crosscorrelation like matrix Rx . In
L + 1 eigenvalues equal to zero, where L is the
the noiseless case there are L
the estimated one.
real order of the channels and L
On figure 4.4 a twochannel system is simulated. The speech source signal is
sampled at 16 kHz and the order of the two impulse responses is 100 taps. These
impulse responses are artificially generated by the multiplication of a white Gaussian noise with a exponential decay. The correlation matrix is compute for an
estimated filter order of 110. There are 11 eigenvalues equals to 0.
4.2.1
As it was explained in the previous section the real impulse response of the room
is the common part of the eigenvectors which span the null space. A method to
compute this greatest common divisor is to write a matrix
K = g1 g2 . . . gLL+1
,
(4.16)
73
Figure 4.4: Eigenvalues of the matrix Rx in the noiseless case. On the right: zoom on
the smallest eigenvalues.
theorem can be found in [21]. By performing rows rotations on the null space
matrix, several estimations of this common part can be obtained and averaged.
The QR decomposition of a matrix A is a factorization expressing A as
A = QR
(4.17)
where Q is an orthogonal matrix (QQT = I), and R is an upper triangular matrix. The matrix Q and R can be computed using the GramSchmidt method.
The GramSchmidt process of linear algebra is a method of orthogonalizing a set
of vectors in an inner product space. Orthogonalization in this context means the
following: we start with vectors v1 , . . . , vk (the column vectors of A) which are
linearly independent and we want to find mutually orthogonal vectors u1 , . . . , uk
which generate the same subspace as the vectors v1 , . . . , vk . The matrix Q represents the orthogonal basis vectors and the matrix R the coordinates of the vectors
v1 , . . . , vk in this new basis.
Figure 4.5 shows, on the left, 4 of the 11 eigenvectors corresponding to the eigenvalue 0 for the system presented in figure 4.4. On the right we can notice that
the normalized common part of the null space and the normalized original impulse responses are equal. The estimated impulse responses begin with 10 zeros
corresponding to the channel order overestimation.
74
Figure 4.5: left: 4 of the 11 eigenvectors of the null space. right: common part of
the null space (blue) and real impulse response (red). The impulse response of the 2
channels are concatenated and 10 zeros (corresponding to the overestimation of the
order) were added.
4.2.2
Noisy case
(4.18)
the matrix Ry has no eigenvalues equal to zero. However, this matrix can be
approximated by
Ry Rx + Rb ,
(4.19)
and as Rb 2 I, where I is the identity matrix and 2 the variance of the noise,
L + 1 smallest eigenvalues will be 2 instead of zero. The corresponding
the L
eigenvectors will remain intact.
Figure 4.6 shows the effect of an additive noise on the eigenvalues for the example system. With a light noise the eigenvalues equal to 2 can clearly be
identified. However as the autocorrelation matrix of the speech signal has very
small eigenvalues (which are not equal to 0), if the signaltonoise ratio decrease,
the smallest eigenvalues are hard to identify.
4.3
Iterative method
75
Figure 4.6: Eigenvalues of the correlation matrix in the noisy case. The variance of
the noise is equal to 1010 on the left and 106 on the right.
the computational load are prohibitive. Huang and Benesty propose in [22] an
iterative method which solves the problem using adaptive filtering instead of a
eigenvalue computation.
The iterative method directly uses the cross relations among the observed signals,
by following the fact that
xi hj = s hi hj = xj hi ,
i, j = 1, 2, . . . , M.
(4.20)
(4.21)
(4.22)
T (n) h
T (n) . . . h
M (n)T
h(n)
= h
(4.23)
1
2
leading to a normalized error signal
ij (n + 1) =
eij (n + 1)
.
kh(n)k
(4.24)
76
M
1
X
M
X
2ij (n + 1),
(4.25)
i=1 j=i+1
and the update equation of the normalized multichannel LMS algorithm (NMCLMS) [10] is
+ 1) = h(n)
h(n
J(n + 1)
(4.26)
where is a small positive update step and J(n + 1) the gradient of the cost
function.
Contrary to the batch method, which give several estimation of the channels in
case of overestimation of the channel orders, the iterative method does not offer
an easy postprocessing method to remove the additional zeros of the estimated
filters.
4.3.1
Huang and Benesty [22] made two propositions to improve the optimization.
Firstly the efficiency of the computation is improved by using a frequencydomain
approach. The convolutions are computed with a Fast Fourier Transform (FFT)
and an overlapsave method.
Then, a Newton method instead of a gradient descent is used to speed up the convergence of the optimization but the Hessian matrix of the cost function has to be
computed. To lower the computational burden this matrix can be approximated
by a diagonal matrix and recursively computed, which reduces the computational
load of the algorithm.
4.3.2
Simulation
This method is faster than the batch implementation and requires less memory.
If the order of the channel is small, the adaptive process can be computed in
realtime. However, if only two microphones are used, the convergence takes too
long as speech signals are generally not well conditioned.
Figure 4.7 (left) shows the iteratively estimated zeros of one of the channels of the
example system presented in figure 4.4, without overestimation of the channel
orders. In the left figure, it can be noticed that most of the zeros are close to
77
their real positions but some are very bad estimated. The estimated impulse
responses can be inverted using the MINT method (section 2.3). The sum of the
convolutions hi,original (n) gi (n) can be computed, where the hi,original (n)s are
the real impulse responses of the systems (which are known in this simulation)
and the gi s are the inverses obtained out of the estimated impulse responses.
The result is shown on figure 4.7 (right), it should be a unitimpulse (drawn in
red), but, as it can be seen, there are large deviations from the Unit impulse due
to the badly estimated zeros.
Figure 4.7: Iterative estimation of the channel impulse responses using two microphones. On the left the estimated zeros (blue) of one of the channels are compared
with their real values (red). On the right the remaining impulse response after inversion of the system is drawn (blue), in the ideal case it should a Dirac (red)
4.4
The main issue of this channel estimation method is that the channel order has
to be overestimated. When it is underestimated the real impulse response can not
be found. Moreover the impulse response of a room is generally very long (more
than 10000 samples for a sampling frequency of 16 kHz). In real environments,
the required memory and the computational load are so high that it is impossible
to estimate the channels.
78
Figure 4.8: Iterative estimation of the channel impulse responses using 5 microphones.
On the left the estimated zeros (blue) of one of the channels are compared with their
real values (red). On the right the remaining impulse response after inversion of the
system is drawn (blue), in the ideal case it should a Dirac (red)
An idea to reduce the length of the room impulse response is to perform a subband
processing using a filterbank and applying a downsampling of the signal in each
subband.
Given fj (n) the analysis filter of the subband j, the signals of channel i in the
subband j is
xi,j (n) = xi (n) fj (n)
= s(n) hi (n) fj (n)
= (s(n) fj (n)) hi (n)
(4.27)
= sj (n) hi (n).
In the subband j, the channel identification problem can be formulated the same
way as before by replacing s(n) with sj (n) = s(n) fj (n).
However, as the filter fj (n) is a bandpass, the new source signal sj (n) has a
limited bandwidth. Its autocorrelation matrix is then not full rank. The first
estimation hypothesis is not fulfilled anymore.
This means that the downsampling of the signal has to be performed not only in
order to reduce the computational load but is also required in order to fulfill the
estimation hypothesis.
However, after the subsampling, the observed subband signals do not seem correlated and the channel identification does not work. In order to find an explanation,
we perform a small simulation. Two signals are randomly generated and filtered
79
with a simple FIR low pass filter in order to be subsampled without aliasing. In
the first experiment the two signal a firstly convolved and then subsampled. In
the second experiment the same signal are subsampled and then the convolution
of the two resulting signal is computed. Figure 4.9 shows the position of the zeros
for the two obtained signals.
Figure 4.9: Comparison of the position of the zeros when the convolution and the
subsampling are performed in a different order.
We can notice that near the unit circle the zeros of the two channels coincides.
This was expected as the Fourier transform of the two signal are the same (up
to a constant factor). However the positions of the zeros far from the unit circle
highly differ.
As the channel identification method uses the transfer functions of the filters and
not only their frequency responses, the crossrelations between the channels will
be impaired by the subsampling and the channel identification will not work.
4.5
80
This method uses second order statistics which assures a robust convergence of
the estimation to the real room impulse response. Even if the batch method can
manage the overestimation of the channel order, the iterative implementation
should be chosen for computational and memory reasons.
All the recent publications on this subject try either to speed up the convergence
of the estimation [24] or to cope with the overestimation problem [25] of the
iterative implementation.
However even if the algorithms optimally converge and the common part of the
estimated channels can be removed it is an utopia to think that such a method
can work for real environment. For computational reasons, the adaption can be
performed in real time. But, even if a supercalculator is available, the probability
that the channels share common zeros logically increases as the number of zeros
increases and, moreover, the smallest eigenvalues of the autocorrelation matrix
of the original speech signal become smaller and smaller as the size of the matrix increases. Therefore the necessary conditions for the channel identification
certainly do not hold anymore when the channel order increases.
Moreover, even if a room impulse response has a finite duration, it is difficult to
estimate when infinitely small coefficients can be neglected. In case of underestimation, even if the truncated coefficient of the filters are small, the effect on the
channel estimation is important and after the inversion of the system strong reverberation will still be present. Sometimes this remaining reverberation is worse
than the reverberation on the observed signal.
At last the subband processing from which we expected to reduce the order of the
channels did not manage to find a channel estimation. This method can therefore
just work with very small channel orders (about 300 samples) and is unusable in
real environments (channel order of about 20000 samples).
Chapter 5
Conclusion and outlook
5.1
This diploma thesis selected the most promising blind dereverberation methods
which exist today and investigated them. We will know review the comments we
formulated about those method after implementing them.
5.1.1
Harmonicitybased dereverberation
5.1.2
82
5.1.3
Channel estimation
5.1.4
Table 5.1 recapitulate some general information about the implemented methods.
For the kurtosis maximization the number of taps of the adaptive filter can be
increased, however in this case the maximization is not robust.
Kurtosis
maximization
yes
500 taps
no
yes
yes
83
Channel estimation
batch
iterative
for short channels
150 taps 300 taps
yes
yes
no
no
no
no
5.2
On a first view the methods based on the estimation and the inversion of the room
impulse response seem more practical to use. Actually, as the effect of the room
is more stationary than speech, its parameters are easier to estimate. However,
such a method has no control on the quality of the output signal as it does not
take into consideration that the input and the output are speech signals.
On the other hand the methods based on a model of the speech will always improve the quality of the speech. However the optimization criterion are subjective,
i.e. the aim is that the speech sounds better. However, the phase spectrum, which
is less important for the speech perception is often neglected in such a method.
5.3
84
Appendix
Appendix A
Proofs
On page 50 the following theorem has been used to derivate the dereverberation
operator of the HERB method.
Given a complex function f (z) =
it is assumed that
1
1+z
(A.1)
+
X
k=0
(z)k ,
(A.2)
88
A Proofs
this series converges as z < 1.
The complex value z can be written
z = re ,
(A.3)
where r and are respectively the absolute value and the phase of z. The
infinite series can then be rewritten as
f (z) =
+
X
rk ek(+) ,
(A.4)
k=0
+
X
E rk ek(+) ,
(A.5)
k=0
k N.
(A.6)
(
1 if k = 0,
0 else,
(A.7)
and all the term of the infinite series are equal to zero except for k = 0.
As the infinite series is reduced to one term, it converges and its value is 1
(E {r0 } = 1).
Second case z > 1:
In this case f (z) can be rewritten as
f (z) =
1
z 1
=
1+z
1 + z 1
(A.8)
A Proofs
89
+
X
(z 1 )k
(A.9)
k=0
+
X
= r1 e
rk ek ek
(A.10)
k=0
=
=
+
X
k=0
+
X
r(k+1) e(k+1) ek
0
rk ek e(k 1)
(A.11)
(A.12)
k0 =1
(
1 if z < 1
0 if z > 1
(A.13)
(A.14)
Bibliography
[1] Brian C. J. Moore. An introduction to the Psychology of Hearing. Academic
Press, 2003.
[2] B.C.J. Moore and B.R. Glasberg. Suggested formulae for calculating
auditoryfilter bandwidths and excitation patterns. J. Acoust. Soc. Am.,
74:750753, 1983.
[3] Jr. John R. Deller, John H.L. Hansen, and John G. Proakis. DiscreteTime
Processing of Speech Signals. IEEE Press, 2000.
[4] Tomohiro Nakatani and Masato Miyoshi. Blind dereverberation of single
channel speech signal based on harmonic structure. IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1:9295,
2003.
[5] Jont Allen and David Berkley. Image method for efficiently simulating small
room acoustics. Journal of the Acoustic Society of America, pages 912915,
1979.
[6] Stephen G. McGovern.
A simple model
(http://www.stevem.us/rir.html). web page.
for
room
acoustics
[7] Mingyang Wu and DeLiang Wang. A onemicrophone algorithm for reverberant speech enhancement. IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), 1:844847, 2003.
[8] Alan V. Oppenheim and Ronald W. Schafer.
PrenticeHall, 1975.
[9] Masato Miyoshi and Yutaka Kaneda. Inverse filtering of room acoustics.
IEEE Transactions On Acoustics. Speech, And Signal Processing, 36(2):145
152, February 1988.
92
BIBLIOGRAPHY
[10] Jacob Benesty, Shoji Makino, and Jingdong Chen. Speech Enhancement.
Springer, 2005.
[11] Martin Heckmann, Frank Joublin, and Edgar Korner. Sound source separation for a robot based on pitch. In IROS, 2005. accepted.
[12] Tomohiro Nakatani, Keisuke Kinoshita, Masato Miyoshi, and Parham S.
Zolfaghari. Harmonicity based blind dereverberation with time warping.
Proc. ISCA Tutorial and Research Workshop on Statistical and Perceptual
Audio Processing (SAPA), October 2004.
[13] B. Yegnanarayana and P. Satyanarayana Murthy. Enhancement of reverberant speech using lp residual signal. IEEE transactions on speech and audio
processing, 8(3):267, May 2000.
[14] Bradford W. Gillespie, Henrique S. Malvar, and Dinei A. F. Florecio.
Speech dereverberation via maximumkurtosis subband adaptive filtering.
IEEE International Conference on Acoustics, Speech, and Signal Processing
(ICASSP), 2001.
[15] Marc Delcroix, Takafumi Hikichi, and Masato Miyoshi. Dereverberation of
speech signals based on linear prediction. INTERSPEECH  ICSLP, 2004.
[16] Nikolay D. Gaubitch, Patrick A. Naylor, and Darren B. Ward. On the use of
linear prediction for dereverberation of speech. In International Workshop
on Acoustic Echo and Noise Control (IWAENC), 2003.
[17] Simon Haykin. Adaptive Filter Theory. Prentice Hall, 2001.
[18] Henrique Malvar. A modulated complex lapped transform and its applications to audio processing. Technical report msrtr9927, Microsoft Research,
May 1999.
[19] James Robert Hopgood. Nonstationary Signal Processing with Application to
Reverberation Cancellation in Acoustic Environments. PhD thesis, Queens
College, University of Cambridge, 2000.
[20] L. Tong, G. Xu, and T. Kailath. A new approach to blind identitication
and equalization of multipath channels. Proc. 25th Asilomar Conf. (Pacific
Grove, CA), pages 856860, 1991.
[21] Sharon Gannot and Marc Moonen. Subspace methods for multimicrophone
speech dereverberation. Technical report, CCIT Report 398, Technion Israel Institute of Technology, Haifa, 2002.
BIBLIOGRAPHY
93
[22] Yiteng Huang and Jacob Benesty. A class of frequencydomain adaptive approaches to blind multichannel identification. IEEE Transactions on Signal
Processing, 51:1124, January 2003.
[23] Zhu Liang Yu and Meng Hwa Er. Blind multichannel identification for
speech dereverberation and enhancement. IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), 4:105108, 2004.
[24] Zhu Liang Yu and Meng Hwa Er. A robust adaptive blind multichannel identification algorithm for acoustic applications. IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), 2:2528, 2004.
[25] Takafumi Hikichi, Marc Delcroix, and Masato Miyoshi. Blind dereverberation based on estimates of signal transmission channels without precise information on channel order. IEEE ICASSP Processing, 1:10691072, 2005.