You are on page 1of 93

Technische Universitat Darmstadt

Institut f
ur Automatisierungtechnik
Fachgebiet Regelungtheorie und Robotik
Prof. Dr.-Ing. J
urgen Adamy
Landgraf-Georg-Str. 4
D-64283 Darmstadt

Diplomarbeit
Study of blind dereverberation algorithms for real-time
applications
Xavier Domont
Work in cooperation with:
Honda Research Institute Europe Gmbh
D-63073 Offenbach/Main

Tutors:
Dr.-Ing. Martin Heckmann (HRI)
Dipl.-Ing. Bjoern Scholing (TUD)
Juni 2005

Abstract
At Honda Research Institute Europe, an automatic speech recognition system is
developed for the humanoid robot ASIMO. The reverberation effect alters the
perception of speech signals emitted in a room and reduces the performance of
automatic speech recognition. A lot of methods have been proposed in the past
few decades to enhance reverberant speech signals. This diploma thesis studies
the most promising algorithms and discusses if they can be implemented in realtime for real environments.
The existing methods can be classified in two families:
1. Those who estimate directly the clean speech signal and treat reverberations
as disturbances.
2. Those who estimate the room impulse response and invert the estimated
system to recover the clean speech.
These two approaches are compared in this thesis, based on implementations
in Matlab of selected algorithms. The focus of this comparison is set on the
suitability of these algorithms for real environments, where speaker and robot
are moving, and a possible real-time implementation.

Kurzfassung
Am Honda Research Institute Europe wird ein automatisches Spracherkennungssystem f
ur den Roboter ASIMO entwickelt. Hall stort die Sprachqualitat
und senkt deutlich die Ergebnisse bei der Spracherkennung. Seit 30 Jahre sind
viele Methoden vorgeschlagen worden, um Sprachsignale zu verbessern. Diese
Diplomarbeit untersucht die aussichtsreichsten Algorithmen im Hinblick auf
Echtzeitfahigkeit und Anwendbarkeit unter realen Bedingungen.
Es gibt zwei Ansatze dieses Problem zu losen:
1. Das original Sprachsignal kann direkt aus dem beobachtete Signal geschatzt
werden. Der Halleffekt wird als Storung des reinen Signals angenommen.
2. Die Raumimpulsantwort kann bestimmt werden und wird dann anschlieend invertiert, um das originale Sprachsignal zu bekommen.
Diese zwei Ansatze werden in dieser Diplomarbeit verglichen. Daf
ur werden ausgewahlte Algorithmen implementiert. Der Hauptpunkt des Vergleichs war die
Untersuchung der Methoden auf Einsetzbarkeit in echten Umgebungen, in denen
sich Sprecher und Roboter bewegen.

Contents

1 Introduction

17

1.1

What is blind dereverberation? . . . . . . . . . . . . . . . . . . .

17

1.2

Motivation of this diploma-thesis . . . . . . . . . . . . . . . . . .

18

1.3

Audio processing architecture on ASIMO . . . . . . . . . . . . . .

19

1.3.1

Overview of the peripheral auditory system

20

1.3.2

The Gammatone filterbank, a model of the basilar membrane 21

1.4

. . . . . . . .

Overview of this report . . . . . . . . . . . . . . . . . . . . . . . .

2 Model of a reverberant signal


2.1

2.2

22
25

Properties of a speech signal . . . . . . . . . . . . . . . . . . . . .

25

2.1.1

Quick overview of the speech production system . . . . . .

25

2.1.2

Speech segments categorization . . . . . . . . . . . . . . .

27

2.1.3

Harmonicity of a speech signal

. . . . . . . . . . . . . . .

28

2.1.4

Linear prediction analysis . . . . . . . . . . . . . . . . . .

28

Room acoustics . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

2.2.1

Measurement of real room impulse responses . . . . . . . .

31

2.2.2

Simulation of the room impulse response . . . . . . . . . .

32

2.2.3

Linear Time-Invariant model of the room . . . . . . . . . .

35

2.2.4

Effect on the spectrogram . . . . . . . . . . . . . . . . . .

36

CONTENTS
2.3

Inversion of the room impulse response . . . . . . . . . . . . . . .

37

2.3.1

Conditions on the inversion of FIR filters . . . . . . . . . .

37

2.3.2

Are room transfer functions minimum-phase ? . . . . . . .

39

2.3.3

Multiple input inverse filter . . . . . . . . . . . . . . . . .

41

3 Enhancement of a speech signal


3.1

3.2

3.3

Harmonicity based dereverberation . . . . . . . . . . . . . . . . .

45

3.1.1

Effect of reverberation on a sweep signal . . . . . . . . . .

46

3.1.2

Adaptive harmonic filtering . . . . . . . . . . . . . . . . .

47

3.1.3

Dereverberation operator . . . . . . . . . . . . . . . . . . .

48

3.1.4

The HERB method . . . . . . . . . . . . . . . . . . . . . .

52

3.1.5

Test of the method . . . . . . . . . . . . . . . . . . . . . .

53

3.1.6

Discussion of the method . . . . . . . . . . . . . . . . . . .

57

Dereverberation using LP analysis . . . . . . . . . . . . . . . . . .

59

3.2.1

Problem formulation . . . . . . . . . . . . . . . . . . . . .

59

3.2.2

The kurtosis as measure of the reverberation . . . . . . . .

60

3.2.3

Maximization of the kurtosis . . . . . . . . . . . . . . . . .

62

Discussion of the method . . . . . . . . . . . . . . . . . . . . . . .

64

4 Equalization of room impulse responses


4.1

45

67

Principle of the channel estimation . . . . . . . . . . . . . . . . .

67

4.1.1

Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

4.1.2

Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

4.1.3

How can this idea be implemented? . . . . . . . . . . . . .

69

4.1.4

Why have the channels to be coprime? . . . . . . . . . . .

70

4.1.5

Estimation of the length of the filters . . . . . . . . . . . .

70

CONTENTS
4.2

Batch method . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

4.2.1

Extraction of the common part . . . . . . . . . . . . . . .

72

4.2.2

Noisy case . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

Iterative method . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

4.3.1

Choice of the optimization method . . . . . . . . . . . . .

76

4.3.2

Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

4.4

Improvement of the method . . . . . . . . . . . . . . . . . . . . .

77

4.5

Discussion of the channel estimation methods . . . . . . . . . . .

79

4.3

5 Conclusion and outlook


5.1

Review of the studied methods

81
. . . . . . . . . . . . . . . . . . .

81

5.1.1

Harmonicity-based dereverberation . . . . . . . . . . . . .

81

5.1.2

Linear prediction analysis . . . . . . . . . . . . . . . . . .

81

5.1.3

Channel estimation . . . . . . . . . . . . . . . . . . . . . .

82

5.1.4

Direct comparison of the methods . . . . . . . . . . . . . .

82

5.2

Speech model based method vs. channel estimation . . . . . . . .

83

5.3

What should we decide for ASIMO? . . . . . . . . . . . . . . . . .

83

A Proofs

87

List of Figures
1.1

Different paths of a sound wave in a room . . . . . . . . . . . . .

17

1.2

General model of a reverberant signal . . . . . . . . . . . . . . . .

18

1.3

General shape of a room impulse response . . . . . . . . . . . . .

18

1.4

Peripheral auditory system [1] . . . . . . . . . . . . . . . . . . . .

20

1.5

Impulse and frequency responses of a Gammatone filter . . . . . .

21

1.6

Analysis filters of a Gammatone filter-bank with 16 channels. . . .

22

2.1

General model of a reverberant signal . . . . . . . . . . . . . . . .

25

2.2

Schematic diagram of the human speech production mechanism


(source: [3]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

2.3

Block diagram of the human speech production (source: [3]) . . .

27

2.4

Discrete-Time speech production model. (a) True Model. (b)


Model to be estimated using LP analysis. (source [3]) . . . . . . .

30

2.5

System to identify (one microphone case) . . . . . . . . . . . . . .

31

2.6

Measurement method . . . . . . . . . . . . . . . . . . . . . . . . .

32

2.7

Example of measured room impulse response . . . . . . . . . . . .

32

2.8

Image Method: Direct path . . . . . . . . . . . . . . . . . . . . .

33

2.9

Image Method: virtual source . . . . . . . . . . . . . . . . . . . .

33

2.10 Image Method: Sound wave reflecting off two walls . . . . . . . .

33

2.11 Room impulse response simulated with the image method . . . . .

35

12

LIST OF FIGURES
2.12 Spectrograms of an anechoic signal (left) and the resulting spectrogram of its convolution with the impulse response of figure 2.7
(right). This spectrograms were obtained with a Gammatone filterbank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

2.13 Inversion of a filter . . . . . . . . . . . . . . . . . . . . . . . . . .

38

2.14 Pole () and zero () of an all-pass filter . . . . . . . . . . . . . .

40

2.15 Energy of a non-minimum phase system (dashed - blue) and the


corresponding minimum-phase system (red). . . . . . . . . . . . .

41

2.16 Multiple input inverse filter . . . . . . . . . . . . . . . . . . . . .

42

3.1

Spectrograms of a sweeping sinusoid and its reverberant signal. . .

46

3.2

Adaptive harmonic filtering . . . . . . . . . . . . . . . . . . . . .

47

3.3

Diagram of the HERB dereverberation method . . . . . . . . . . .

52

3.4

up-left: Original signal (sweep with harmonics). up-right: Reverberant signal. bottom-left: Harmonic estimate with the
Gammatone filter-bank. bottom-right: Harmonic estimate with
Nakatanis harmonic filter. . . . . . . . . . . . . . . . . . . . . . .

54

Spectrogram of the clean and reverberant signal used to test the


reverberation operator. . . . . . . . . . . . . . . . . . . . . . . . .

56

Spectrogram of the enhanced signal computed in the frequency


domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

Impulse response of the dereverberation operator and spectrogram


of the enhanced signal computed in the time domain. . . . . . . .

57

3.8

Effect of the reverberation on the fundamental frequency. . . . . .

58

3.9

Example of platykurtic (left) and leptokurtic (right) distributions.


Both distributions have the same standard deviation . . . . . . .

60

3.10 On the left, extract of the LP residuals of a speech signal. Note


the strong peaks corresponding to the glottal pulses. On the right,
the same signal impaired by reverberations. . . . . . . . . . . . .

61

3.5

3.6

3.7

LIST OF FIGURES

13

3.11 Estimation of the probability density functions of the LP residuals


of a clean speech signal (blue) and of a reverberant signal (red).
Both signal have been centered and normalized such that their
means = 0 and their standard deviations = 1. . . . . . . . . .

61

3.12 (a) A single channel time-domain adaptive algorithm for maximizing the kurtosis of the LP residuals. (b) Equivalent system, which
avoids LP reconstruction artifacts. . . . . . . . . . . . . . . . . . .

63

3.13 Two-channel frequency-domain adaptive algorithm for maximization of the kurtosis of the LP residual. . . . . . . . . . . . . . . .

64

3.14 On the left the LP residual of a clean signal. On the right the LP
residual of the resulting dereverberated signal. The kurtosis of the
dereverberated signal is higher than the kurtosis of the original
signal. The resulting signal is strongly distorted. . . . . . . . . . .

65

4.1

SIMO System . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

4.2

Channel identification with overestimated channel orders. . . . . .

71

4.3

Estimated zeros and real zeros for one channels (left). Zeros of all
the estimated channels. On the left 4 estimated zeros are alone,
they do not correspond to a real pole of the filter. On the right it
can be noticed that these 4 additional zeros are common to all the
estimated channels. . . . . . . . . . . . . . . . . . . . . . . . . . .

72

Eigenvalues of the matrix Rx in the noiseless case. On the right:


zoom on the smallest eigenvalues. . . . . . . . . . . . . . . . . . .

73

left: 4 of the 11 eigenvectors of the null space. right: common


part of the null space (blue) and real impulse response (red). The
impulse response of the 2 channels are concatenated and 10 zeros
(corresponding to the over-estimation of the order) were added. .

74

Eigenvalues of the correlation matrix in the noisy case. The variance of the noise is equal to 1010 on the left and 106 on the
right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

4.4

4.5

4.6

14

LIST OF FIGURES
4.7

4.8

4.9

Iterative estimation of the channel impulse responses using two


microphones. On the left the estimated zeros (blue) of one of the
channels are compared with their real values (red). On the right
the remaining impulse response after inversion of the system is
drawn (blue), in the ideal case it should a Dirac (red) . . . . . . .

77

Iterative estimation of the channel impulse responses using 5 microphones. On the left the estimated zeros (blue) of one of the
channels are compared with their real values (red). On the right
the remaining impulse response after inversion of the system is
drawn (blue), in the ideal case it should a Dirac (red) . . . . . . .

78

Comparison of the position of the zeros when the convolution and


the subsampling are performed in a different order. . . . . . . . .

79

Acronyms
FFT Fast Fourier Transform
DFT Discrete Fourier Transform
STFT short time Fourier transform
SISO Single-Input Single-Output
SIMO Single-Input Multiple-Output
MIMO Multiple-Input Multiple-Output
ROC Region of Convergence
LTI Linear Time-Invariant
FIR Finite Impulse Response
IIR Infinite Impulse Response
MINT Multiple input inverse filter

Chapter 1
Introduction
1.1

What is blind dereverberation?

Figure 1.1: Different paths of a sound wave in a room

The acoustic signals emitted in a room reflect off the walls and other objects
(see figure 1.1). The direct signal and all the reflected sound waves arrive to
the microphone/listener with a different delay and sum up. This effect is called
reverberation. Sometimes the term echo is used instead of reverberation. However
echo generally implies a distinct, delayed version of a sound. In a room, each
delayed sound wave arrives in such a short period of time that we do not perceive
each reflection as a copy of the original sound. Even though we cant discern

18

1 Introduction

every reflection, we still hear the effect of the entire series of reflections.
Whereas a human being, without hearing problems, can quite well cope with these
distortions, this reverberation effect impairs the speech intelligibility in devices
such as hands-free conferences telephones and automatic speech recognition.
The diagram in figure 1.2 shows how the system can be modeled. The effect of
the room is considered as a filter with impulse response h(t) whose input is the
clean speech signal s(t) and the output is the observed reverberant signal x(t).
s(t) h(t) x(t)
Figure 1.2: General model of a reverberant signal

Figure 1.3 shows the general shape of a room impulse response. The reverberation corrupts the speech by blurring its temporal structure. However, due to the
spectral continuity of speech, the early reflections mainly increase the intensity of
the reverberant speech, whereas the later ones are deleterious to speech quality
and intelligibility.

Figure 1.3: General shape of a room impulse response

The aim of the blind dereverberation is to recover the clean signal s(t) out of the
observed reverberant signal x(t). The term blind means that neither the clean
signal nor the impulse response of the room are known before the processing.

1.2

Motivation of this diploma-thesis

This diploma thesis was written in cooperation with Honda Research Institute
(HRI) Europe. One of the important projects of HRI is the development of the

1.3 Audio processing architecture on ASIMO

19

humanoid robot ASIMO (Advanced Step in Innovative MObility). At HRI Europe


the CARL (Child-like Acquisition of Representation and Language) Group of Dr.
Frank Joublin aims at developing a system of automatic speech recognition (ASR)
and production for ASIMO. As the distortions caused by reverberations alters
the performance of ASR, we will investigate if a signal processing method can be
found to dereverberate the signals heard by ASIMO.
During the past decades many dereverberation methods have been proposed.
However, no standard method has yet been found and this research topic is still
very active. The aim of this diploma-thesis is to establish a state of the art of the
existing methods and then to evaluate if some of them could be integrated to the
audio processing system of ASIMO.
The important requirements for ASIMO are, firstly, that the dereverberation is
processed in real-time, and, secondly that the system must adapt to a real and
changing environment. It means that the algorithms have to adapt themselves to
the room conditions faster than these conditions change. As both ASIMO and
the speaker can be in movement the effects of the room are susceptible to change
very rapidly.
To perform this study we selected, out of the recently proposed methods, the ones
which seemed the most promising. The selected methods have then been implemented in MATLAB in order to determine their advantages and their drawbacks.
For the implementation, we try, as often as possible, to use the existing audio
processing architecture of ASIMO, described in section 1.3.
In addition to the analysis of their performances, it will be discussed if the studied
methods, while enhancing the perception of speech, do not alter some signal characteristics which are used by following audio processing on ASIMO. In particular,
the phase spectrum is essential to the localization of a source of speech.

1.3

Audio processing architecture on ASIMO

The audio processing system at HRI uses a Gammatone filterbank. This type
of filterbank is widely used in audio signal processing as it simulate the human
auditory system.

20

1 Introduction

Figure 1.4: Peripheral auditory system [1]

1.3.1

Overview of the peripheral auditory system

The aim of the peripheral auditory system (see figure 1.4) is to transform a sound
(which is actually a pressure variation in air) into nerve impulses. These impulses
are then conveyed by the auditory nerve to the brain stem. The nerve cells in
the brain stem act as relay stations, eventually conveying nerve impulses to the
auditory cortex.
The outer ear is composed of the pinna (the visible part) and the auditory canal
or meatus. The pinna significantly modifies the incoming sound in a way that
depends on the angle of incidence of the sound relative to the head. This is
important for the sound localization. Sound travels down the meatus and cause
the eardrum, or tympanic membrane, to vibrate. These vibrations are transmitted
through the middle ear by three small bones, the osscicles, to a membrane-covered
opening in the bony wall of the spiral-shaped structure of the inner ear, the
cochlea.
The cochlea is shaped like the spiral shell of a snail. It is filled with almost incompressible fluids and is divided along its length by two membranes, the Reissners
membrane and the the basilar membrane. The motion of the basilar membrane
in response to a sound is of primary interest.

1.3 Audio processing architecture on ASIMO

1.3.2

21

The Gammatone filterbank, a model of the basilar


membrane

A point on the basilar membrane is characterized by its impulse response. The


Gammatone function approximates physiologically recorded impulse responses:
g(t) = tn1 exp(2bt) cos(2f0 t + )

(1.1)

where t is the time (t 0), b determines the duration of the impulse response, n
is the order of the filter and determines the slope of the skirts of the filter, is a
phase and f0 is the center frequency.

Figure 1.5: Impulse and frequency responses of a Gammatone filter

It can be observed from figure 1.5 that the Gammatone filter is a bandpass with
its center frequency at f0 . Its bandwidth depends on b.
To simulate the whole basilar membrane, a bank of Gammatone filters can be
used. Each filter channel represents the frequency response of one point on the
basilar membrane.
The parameters of the Gammatone filters are determined out of psychoacoustic
measurements. Glasberg and Moore [2] summarized the equivalent rectangular
bandwidth (ERB) of the human auditory filter. The ERB of a filter is defined as
the width of a rectangular filter whose height equals the peak gain of the filter
and which passes the same total power as the filter.
The relation between the bandwidth and the center frequency of the Gammatone
filters is given by:
ERB = 24.7 + 0.108 f0 .
(1.2)

22

1 Introduction

The figure 1.6 shows the transfer functions for a bank of 16 filters with center
frequency spaced between 50 Hz and 8 kHz. As the spectral resolution of the
basilar membrane decreases as the frequency increases, the center frequencies of
the Gammatone filters are not linearly distributed and their bandwidth increase
with the center frequency according to equation (1.2). We can also note that the
pass bands overlap.

Figure 1.6: Analysis filters of a Gammatone filter-bank with 16 channels.

1.4

Overview of this report

The existing blind dereverberation methods can be classified in two families.


1. We can estimate directly the clean speech signal, or the parameters and
excitation of an appropriate parametric model, as a missing data problem
by treating reverberations as disturbances.
2. We can model the effect of the room by a filter. The coefficients of this filter
are estimated by treating the clean speech as disturbance. The observed
signal is then deconvolved by the estimated filter to recover the clean speech.
In chapter 2 we will discuss how speech signals and room impulse responses can
be modeled. This modeling step is essential to determine what, in the observed
signal, is due to the speech and what is an effect of the room.

1.4 Overview of this report

23

In chapter 3 two methods, which use the properties of speech to enhance reverberant signals, will be studied. These methods consider that the room effect
is a disturbance and try and restore the characteristics of the speech that the
reverberations altered.
In chapter 4 the possibility of estimating the room impulse responses will be
discussed. This approach is very interesting as, knowing the effect of the room on
the signal, it will then make possible to revert this effect and recover the clean
signal.

Chapter 2
Model of a reverberant signal
In terms of signal processing a room can be seen as a filter. The original (anechoic)
signal s(n) goes through a filter h(n) and gives the reverberant signal x(n), see
figure 2.1. In the case of blind dereverberation the input signal s(n) and the room
transfer function h(n) are unknown.
s(n) h(n) x(n)
Figure 2.1: General model of a reverberant signal

The task of dereverberation is to find an estimate s(n) of s(n), given the output
x(n) of the system. In order to make this task feasible, a model of the speech
signal and/or a model of the room are required.
In section 2.1 different ways to model a speech signal will be discussed. In section
2.2 the effects of the room on the speech signal will be investigated. At last section
2.3 will discuss the possibility of inverting the effects of the room.

2.1
2.1.1

Properties of a speech signal


Quick overview of the speech production system

The principal components of the human speech production system are (see figure
2.2) the lungs, trachea(windpipe), larynx (organ of voice production), pharyn-

26

2 Model of a reverberant signal

Figure 2.2: Schematic diagram of the human speech production mechanism (source:
[3])

geal cavity (throat), oral or buccal cavity (mouth), and nasal cavity (nose). The
pharyngeal and oral cavities are usually grouped and referred to as vocal tract.

It is useful to think of speech production in terms of an acoustic filtering operation. The pharyngeal, oral and nasal cavities comprise the main acoustic filter.
This filter is excited by the organs below it, and is loaded at its main output by
a radiation impedance due to the lips. The articulators are used to change the
properties of the system, its form of excitation, and its output loading over time.
Figure 2.3 shows a simplified acoustic model illustrating these ideas.

2.1 Properties of a speech signal

27

Figure 2.3: Block diagram of the human speech production (source: [3])

2.1.2

Speech segments categorization

The spectral characteristics of the speech wave are non-stationary, since the physical system changes rapidly over time. Speech can therefore be divided into sound
segments which present similar properties over a short period of time. Without
going into further details the main way to classify a speech sound is with the type
of excitation.
The two elementary types of excitation are voiced and unvoiced. There are actually a few other type of excitation (mixed, plosive, whisper, silence) but they can
be seen just as a combination of the two elementary types.
Voiced sounds are produced by forcing air through the glottis, an opening between
the vocals folds. The vocal cords vibrate in oscillatory fashion and, therefore, the
produced speech signal is quasi-periodic, its period is called fundamental period
1
T0 ; the fundamental frequency F0 can be defined as .
T0

28

2 Model of a reverberant signal

Unvoiced sounds are generated by forming a constriction at some point along the
vocal tract, and forcing air through the constriction to produce turbulence. The
produced speech signal is a noise-like sound.
Typical human speech communication is limited to a bandwidth of 7-8 kHz. The
main part of the energy is contained in voiced segments.

2.1.3

Harmonicity of a speech signal

A speech signal s(n) can be modeled [4] by using the sum of a harmonic signal
sh (n), derived from a glottal vibration, and a non-harmonic signal sn (n), such as
fricatives and plosives, as
s(n) = sh (n) + sn (n).

(2.1)

The harmonic part of the signal is defined by its voiced durations and their fundamental frequencies (F0 ). A voiced duration is the time during which the vocal
cords vibrate to generate a harmonic signal and the fundamental frequency refers
to the frequency of the fundamental component of the signal. Each harmonic
component has a frequency which corresponds to F0 or its multiples.
It can be assumed that F0 is constant within a short time, therefore the harmonic
signal, sh (h), can be modeled over a time frame of length T by the sum of sinusoidal components whose frequencies coincide with the fundamental frequency of
the signal and its multiples:
N
X

n nc
sh (n) =
Ak cos kF0
+ k
fs
k=1


for |n nc | <

T
2

(2.2)

where Ak and k are the amplitude and the phase of the k-th harmonic component, nc the time index of the center of the frame and fs the sampling rate.

2.1.4

Linear prediction analysis

A widely used model of speech signals is given by Linear Prediction (LP) analysis.
This model consists of separating the speech signal into a excitation signal and
a model of the vocal tract.

2.1 Properties of a speech signal

29

During a stationary frame of speech the model would ideally be characterized by


a pole-zero transfer function of the form
1+
(z) = 00
1

L
X
i=1
R
X

b(i)z i
(2.3)
a(i)z i

i=1

which is driven by an excitation sequence

+
X

(n qP ),

q=
e(n) =

zero mean, unity variance,

uncorrelated noise,

voiced case
(2.4)
unvoiced case

where (n) is the discrete Dirac


(n) =

(
1 if t = 0,
0 else.

(2.5)

The principle of the LP analysis is to approximate this pole-zero system with an


all-pole system
1

(z)
=
(2.6)
R
X
1
a
(i)z i
i=1

which can be easily estimated by solving a system of linear equations. The


schematics of the true speech model and of its LP approximation are shown in
figure 2.4.
A magnitude spectrum, but not a phase spectrum1 , can be exactly modeled with
stable poles. It means that the LP analysis will model the true magnitude
spectrum of the speech which is, in the most cases, enough for speech perception.
For example, a listener moving from room to room within a house is able to clearly
understand speech of a stationary talker, even if the phase relationships among
the components are changing dramatically [3]. However, for some applications like
the localization of the talker, the temporal dynamics of the sound are essential
and the LP analysis should be used with care.
1

Actually the LP model has a minimum-phase characteristic. This notion will be discussed
more in details in section 2.3

30

2 Model of a reverberant signal

Figure 2.4: Discrete-Time speech production model. (a) True Model. (b) Model to
be estimated using LP analysis. (source [3])

To understand the name Linear Prediction, it is helpful to consider the LP


analysis in the time domain. An all-pole transfer function corresponds to an
autoregressive (AR) model, i.e. the signal s(n) can be expressed as a linear combination of its L past samples:
s(n) =

L
X

ak s(n k) + e(n)

(2.7)

k=1

where ak are the LP coefficients. The excitation signal e(n) can be seen in terms
of system identification as the prediction error signal, also called LP residual.

2.2

Room acoustics

This section will firstly present how room impulse responses can be measured
(2.2.1) or simulated (2.2.2). The goal is to obtain a set of real and artificial

2.2 Room acoustics

31

impulse responses. These data will be useful in the next chapters to test the
dereverberation methods.
In a second time (2.2.3) we will discuss if a general model of a room can be found.
At last (2.2.4), we will use time-frequency analysis to shortly study the effects of
reverberation on speech signals.

2.2.1

Measurement of real room impulse responses

In order to get real impulse response corresponding to a normal room, we performed measurements in the office of the CARL group at HRI. A sound signal
was played through the room by a loudspeaker. Simultaneously the sound wave
was recorded using a model of ASIMOs head equipped with two microphones.

s(n) h(n) x(n)

Figure 2.5: System to identify (one microphone case)

For each microphone both the input s(n) and the output x(n) of the system are
known, the impulse response h(n) can be then computed by the reverting the
convolution
x(n) = h(n) s(n).
(2.8)
However the measurement is generally altered by additive noise. To improve the
measurement it is therefore better to use auto- and cross-correlation functions.
Equation (2.8) becomes
Rsx (n) = h(n) Rs (n)

(2.9)

where Rs (n) is the autocorrelation function of s(n) and Rsx (n) the crosscorrelation function of s(n) and x(n). Equation (2.9) is less sensitive to noise.
Moreover, if s(n) is a white noise, its autocorrelation function is equal to (n).
Then
h(n) = Rsx (n), when Rs (n) = (n).
(2.10)
The impulse response of the room is equal to the autocorrelation function of the
white noise, played by the loudspeaker, and signal recorded by the microphone.
For our measurement we used 1 second of Gaussian white noise as room excitation
signal.

32

2 Model of a reverberant signal

Two different sound cards were used to play and record the signals. In order to
easily synchronize the input and the output of the system, the excitation signal
was directly recorded on another channel of the capture sound card in addition
to the signals of the microphones (see figure 2.6).
x(n)

r -

h(n)

- y(n)
- x(n)

Figure 2.6: Measurement method

Moreover this method permits to compensate eventual effects of the sound cards.
As we only disposed of a stereo sound card, the record for the left and right ears
had to be performed separately. Figure 2.6 shows one of the measured impulse
response.

Figure 2.7: Example of measured room impulse response

2.2.2

Simulation of the room impulse response

A technique to simulate the impulse response of a room is the image method


proposed in 1979 by Allen [5]. It sums the direct path with all reflections on walls
or objects.
An example in [6] shows the principle of this method. Figure 2.8 shows the direct
path from a sound source () to a microphone (). Another part of the sound wave

2.2 Room acoustics

33

Figure 2.8: Image Method: Direct path

is reflected off a wall and then impinges upon the microphone. This reverberated
sound seems to come directly from a virtual source located in an adjacent room,
symmetrical to the original room relatively to the wall (see figure 2.9). On this
figure the black line represents the real path of the signal, whereas the blue line
is its perceived path.
H


HH



H







Figure 2.9: Image Method: virtual source

This process can be extended to sound waves that are reflected more than once
off the walls (see figure 2.10). This process can be continued the same way in the




HH





H


XX

H
XXX






XXX



XX












Figure 2.10: Image Method: Sound wave reflecting off two walls

three dimensions to get an infinity amount of virtual sources.

34

2 Model of a reverberant signal

The virtual sources permit to easily compute the distance the sound wave travels
to arrive at the microphone.
Considering a rectangular room with dimensions (Lx , Ly , Lz ), the coordinatesvector ri,j,k = (xi , yj , zk )T ((i, j, k) Z3 ) of a virtual source is:


1 (1)i
i
xi = (1) xsource + i +
Lx
2


1 (1)j
j
(2.11)
yj = (1) ysource + j +
Ly
2


1 (1)k
k
zk = (1) zsource + k +
Lz
2
where rsource = (xsource , ysource , zsource )T is the coordinate-vector of the source.
The distance from the virtual source to the microphone is
q
di,j,k = kri,j,k rm k = (xi xm )2 + (yj ym )2 + (zk zm )2

(2.12)

where rm = (xm , ym , zm )T is the coordinate-vector of the microphone.


The sound wave corresponding to the (i, j, k) virtual source will arrive at the
microphone with a delay
di,j,k
i,j,k =
(2.13)
c
where c is the speed of sound. The impulse response of the room is the sum of the
delayed impulse corresponding to the signals arriving from each virtual source:


X
di,j,k
h(t) =
hi,j,k t
(2.14)
c
i,j,kZ
The magnitude hi,j,k of the unit impulse is influenced by the distance the sound
wave travels to get from the source to the microphone
bi,j,k =

1
4d2i,j,k

(2.15)

and by the number of reflections of the walls


ci,j,k = |i|+|j|+|k|

(2.16)

where < 1 is the wall reflection coefficient (which is, in this simple model,
considered to be the same for all the walls).
hi,j,k = bi,j,k ci,j,k

(2.17)

2.2 Room acoustics

35

Although the impulse response of the room should contain an infinite number of
delayed impulses, corresponding to an infinity of virtual sources, the magnitudes
hi,j,k become very small for large i j and k. The impulse response has then a
finite time duration:


n
n
n
X
X
X
di,j,k
h(t) =
hi,j,k t
(2.18)
c
i=n j=n k=n

Figure 2.11: Room impulse response simulated with the image method

Figure 2.11 shows a simulated room impulse response obtained with the image
method. Reverberant sounds generated using such an impulse response sound like
signals recorded in real conditions. However phenomenon like the phase inversion
of the sound wave when it reflects off a wall, the presence of objects or people
are ignored by this model.

2.2.3

Linear Time-Invariant model of the room

The general shape of the measured and the simulated room impulse responses
corresponds to the one described on figure 1.3. However, when the conditions in
the room are changing (movement of the talker and/or listener) the coefficients
of the impulse response have big fluctuations especially in the late reverberation
tail. As we explained in chapter 1, the distortions in the speech signal are mostly
due to the late reverberation. Therefore, a model based on the image method,
where the room impulse responses are modeled by a sum of delayed impulses
hi,j,k (t i,j,k ), is not practical for a system identification.
Actually the only general properties which can be retained from a room impulse

36

2 Model of a reverberant signal

response are its linearity, its causality (there is no reverberation before the beginning of the signal) and its general exponential decay structure.
In real environments, the talker or the listener are in movement, therefore the
effect of the room is time-variant. However if we assume that the computation is
fast enough the system can be considered Linear Time-Invariant (LTI).
Moreover, because of the exponential decay, the impulse response of the room has
a finite duration. The room is then modeled by a Finite Impulse Response (FIR)
filter. The relation between the input s(n) and the output x(n) is given by the
convolution
L1
X
x(n) = h(n) s(n) =
h(k)s(n k),
(2.19)
k=0

where L is the length of the impulse response (also called order of the channel).
Actually, the FIR model of the room impulse response is very practical as the
transfer function of the system, i.e. the z-transform of its impulse response,
H(z) =

L1
X

h(k)z k

(2.20)

k=0

is defined for all finite z and is a polynomial.

2.2.4

Effect on the spectrogram

It is interesting to study the effect of reverberation on the spectrogram2 of a


speech signal. Figure 2.12 shows the spectrograms of the same speech signal
without and with reverberation.
The problem can be explained in the time-frequency domain as: Given the spectrogram of the original signal at time frame t and frequency f , S(t, f ), what is
the influence of the room on the spectrogram of the reverberant signal at timefrequency bin (t0 , f 0 ), X(t0 , f 0 )?.
The value of X at time frame t0 is only affected by bins of the original signal
that are between time frames t0 and t0 D where D depends on the reverberation
2

Instead of the spectrogram, the Gammatone filter-bank described in chapter 1 can be


used. Contrary to a normal spectrogram, computed using a short-time Fourier transform, the
Gammatone filterbank gives an output for each time sample (no subsampling) and the center
frequency of the filters are not linearly distributed, which is closer to the human auditory
system.

2.3 Inversion of the room impulse response

37

Figure 2.12: Spectrograms of an anechoic signal (left) and the resulting spectrogram
of its convolution with the impulse response of figure 2.7 (right). This spectrograms
were obtained with a Gammatone filter-bank.

time of the room, i.e. the time for the sound to die away to a level of 60 dB
below its original level. In the frequency-domain, the reverberation affects slightly
the adjacent channels. According to [7], this effect has the form of a Laplace
distribution.

2.3

Inversion of the room impulse response

In this section the theoretical possibility of a perfect dereverberation will be


discussed. The issue can be formulated in the following way: Assuming that the
room impulse response is known, is it possible to remove its effect and to get an
accurate estimate of the original speech signal?.

2.3.1

Conditions on the inversion of FIR filters

The inverse g(n) of a filter h(n) (see figure 2.13) is


s(n) = g(n) x(n)
= g(n) h(n) s(n)

(2.21)

= s(n)
which can be simplified to
g(n) h(n) = (n)

(2.22)

38

2 Model of a reverberant signal

s(n) h(n) x(n) g(n) s(n)

Figure 2.13: Inversion of a filter

The inversion problem can be studied with the help of the z-transform. The ztransform of h(n), also called transfer function of the filter, is defined as the power
series
+
X
H(z) =
h(k)z k
(2.23)
k=

It was shown in section 2.2 that the room can be considered as a FIR filter. Then
its z-transform is a polynomial
H(z) = h0 + h1 z 1 + + hN z L+1

(2.24)

where L is the length of the room impulse response


H(z) = h0 z L+1 (z z1 )(z z2 ) (z zL1 )

(2.25)

H(z) has L 1 finite zeros at z = z1 , z2 , . . . , zN .


The transfer function of the inverse filter is then the rational function
G(z) =

h0 + h1 z

1
+ + hN z L+1

(2.26)

The Infinite Impulse Response (IIR) filter G(z) is causal and stable if and only if
all its poles are inside the unit circle (|z| = 1). As the poles of G(z) are the zeros
of H(z), this means that all the zeros of H(z) must be inside the unit circle. Such
a system is called minimum-phase.
In order to understand this problem, we can observe what happens if we want to
invert a simple non minimum-phase system.
Given the FIR filter h(n), defined in the time-domain by:
h(n) = (n) 2(n 1)
Its transfer function is
H(z) = 1 2z 1

2.3 Inversion of the room impulse response

39

The Region of Convergence (ROC) of this z-transform is |z| > 0. As this system
has a zero at z = 2, it is non minimum-phase.
The transfer function of its inverse system is:
G(z) =

1
z
=
1 2z 1
z2

G(z) has a zero at the origin and a pole at z = 2. In this case there are two
possible regions of convergence and hence two possible inverse systems. If the
ROC of G(z) is taken as |z| > 2, then
g(n) = 2n u(n)
where u(n) is the unit step function
u(n) =

(
1 if n 0,
0 else.

(2.27)

This is the impulse response of a causal and instable system. On the other hand,
if the ROC is assumed to be |z| < 2, the impulse response of the inverse system
is
g(n) = 2n u(n 1).
In this case the inverse system is anti-causal and stable.

2.3.2

Are room transfer functions minimum-phase ?

Any system can be represented as the cascade of a minimum-phase system with


an all-pass system [8]. An all-pass system is defined as a system for which the
magnitude of the transfer function is unity for all frequencies. Thus if Hap (z)
denotes the z-transform of an all-pass system, |Hap (e )| = 1 for all .
The poles and zeros of an all-pass system occur at conjugate reciprocal locations
(see figure 2.14).
If we consider a non-minimum-phase system H(z), with, for example, one zero
outside the unit circle at z = z10 , |z0 | < 1, and the remainder of its poles and
zeros are inside the unit circle. Then H(z) can be expressed as
H(z) = H1 (z)(z 1 z0 )

(2.28)

40

2 Model of a reverberant signal


z-plane
z0 = re
p0 =
1
z0

1
z0

= 1r e

= 1r e
I
@

unit circle
Figure 2.14: Pole () and zero () of an all-pass filter

where H1 (z) is minimum-phase. Equivalently equation (2.28) can be written as


 1 z0 z 1
H(z) = H1 (z)(z 1 z0 )
1 z0 z 1
 z 1 z0
= H1 (z)(1 z0 z 1 )
1 z0 z 1
z 1 z0
= Hmin (z)
1 z0 z 1

(2.29)

= Hmin (z)Hap (z)


where Hmin (z) is minimum-phase and Hap (z) is all-pass. Any pole or zero of H(z)
that is inside the unit circle also appears in Hmin . Any pole or zero of H(z) that
is outside the unit circle appears in Hmin in the conjugal reciprocal location.
The equivalent minimum-phase system has the same magnitude spectrum as the
original system.
It is interesting to compare the impulse response h(n) of an FIR system with the
impulse response hmin (n) of its equivalent minimum-phase system.
Figure 2.15 shows that the energy of hmin (n) is more concentrated around the
origin. This property can be formalized with the following equation3 :
m
X
n=0

|h(n)|

m
X

|hmin (n)|2 ,

m N

(2.30)

n=0

The energy of both sequences is the same since the magnitude of their Fourier
3

A proof of this property is outlined in [8] page 371.

2.3 Inversion of the room impulse response

41

Figure 2.15: Energy of a non-minimum phase system (dashed - blue) and the corresponding minimum-phase system (red).

transforms is the same (by Parsevals Theorem). This means that the equality
occurs in (2.30) when m .
The room transfer functions have often more energy in the reverberant component of the room impulse response than in the component corresponding to the
direct path (see figure 1.3). This implies that room transfer function are often
non-minimum-phase. A causal and stable inverse of a room impulse response is
therefore impossible to find in general. The non-causality problem can be solved
by introducing a delay, i.e. a delayed inverse filter is computed instead. However
the delay have to be generally quite long, and this is not satisfying for real-time
applications.

2.3.3

Multiple input inverse filter

As the room transfer functions are most of the time non-minimum-phase a perfect
dereverberation cannot be achieved with a single microphone. It is possible to
find a delayed inverse filter but this solution is not really adequate for real-time
processing.
However it is possible to find the exact inverse of a point in the room by us-

42

2 Model of a reverberant signal

ing multiple microphones4 , if the room transfer functions corresponding to the


different sensors are coprime, i.e. they do not share common zeros [9].
This property is actually a direct application of the Bezouts theorem on polynomials. Given M FIR filters with transfer function Hi (z), i = 1, . . . M , if the
Hi (z)s are coprime polynomials, then Gi (z), i = 1, . . . M , such that:
H1 (z)G1 (z) + H2 (z)G2 (z) + . . . + HM (z)GM (z) = 1

(2.31)

where the orders of the Gi (z)s are smaller or equal than the highest order of
the Hi (z)s. Figure 2.16 shows how equation (2.31) can be used to invert the M
channels simultaneously. This method is called Multiple-input/output INverse
Theorem (MINT).
-

H1 (z)

G1 (z)

H2 (z)

G2 (z)
L

s(n)
..
.
- HM (z)

- s
(n) s(n)

..
.
- GM (z)

Figure 2.16: Multiple input inverse filter

By using more than one microphone, the issue that room transfer functions are
non minimum-phase is bypassed. Moreover the inverse filters are simple FIR
filters, which can be computed by solving the linear system


d = HT1 HT2 HTM g = Hg
(2.32)
where d = [1, 0, . . . , 0] is a vector of length 2L 1, g is the concatenation of the
vector gi = [gi (0), . . . , gi (L 1)]T corresponding to the inverse filters


T T
g = g1T . . . gM
(2.33)
and the Hi s are the L (2L 1) Sylvester matrices corresponding to the polynomials Hi (z)

hi (0) hi (L 1)
0

..
...
...
Hi = ...
(2.34)

.
0

hi (0)
hi (L 1)
4

Such a system is called Single-Input Multiple-Output (SIMO)

2.3 Inversion of the room impulse response

43

A Sylvester matrix permits to compute a convolution (or a polynomial multiplication) with a matrix multiplication. Given two signals x(n) and y(n), respectively
of length Lx and Ly , the convolution z(n) of x(n) and y(n) has Lx + Ly 1
samples and can be written in a vector form as
z = XT y = YT x,

(2.35)

where X, resp. Y, is the Ly (Lx + Ly 1), resp. Lx (Lx + Ly 1), Sylvester


matrix of x(n), resp. y(n), and y, resp, x, is the signal y(n), resp. x(n), written
as a column vector of length Ly , resp. Lx .
The linear system of equation (2.32) can be solved by computing the MoorePenrose pseudo-inverse5 of the matrix H, H+ . The inverse filter is then computed
by
g = H+ d.
(2.36)
As d = [1, 0, . . . , 0], g is actually the first column of H+ .
The linear system of equation (2.32) has infinitely many solution. The pseudoinverse method gives the solution with the smallest norm kk2 .

The Moore-Penrose pseudo-inverse is a matrix H+ of the same dimensions as HT satisfying


four conditions: HH+ H = H, H+ HH+ = H+ , HH+ and H+ H are Hermitian.

Chapter 3
Enhancement of a speech signal
Reverberation produces a distortion that alters the intelligibility of speech. A
possible approach to the dereverberation problem is to consider the general properties of a speech signal, which are degraded by the reverberation.
A simple way to improve the reverberant signal is, for example, to detect reverberation tails between words. By removing, or attenuating, these parts which
only contain reverberation, the listening comfort is slightly improved. However,
this method, which is used in hearing aids, does not remove the distortion which
alters the words.
The two methods presented in this chapter use, more or less explicitely, the
harmonicity property of the voiced segments of a speech signal, in order to try
and recover the clean signal. In section 3.1 the approach of Nakatani[4], using a
adaptive harmonic filter, will be described. In section 3.2 an adaptive algorithm
working on the LP residual of the speech signal will be presented.

3.1

Harmonicity based dereverberation

Nakatani and al. propose in [4] an interesting single microphone dereverberation method called Harmonicity based dEReverBeration (HERB). This method
is based on the harmonicity model of speech described in section 2.1.
The principle is to estimate a dereverberation operator using the harmonic parts
of the speech signal. This operator, initially designed for the harmonic parts, is
expected to work on the non-harmonic parts as well.

46

3 Enhancement of a speech signal

Figure 3.1: Spectrograms of a sweeping sinusoid and its reverberant signal.

The performance of this method, presented in [4] and [10], are impressive. In this
section we will begin by describing the principle of this dereverberation process.
Then we will discuss its applicability on ASIMO.

3.1.1

Effect of reverberation on a sweep signal

In order to understand the basic idea of the HERB method, it is useful to observe
the effect of reverberation on a sweeping sinusoid. In discrete time the sinusoidal
sweep is define by
!
 2
k n
n
s(n) = A sin 2
(3.1)
+ fstart
2 fs
fs
where A is the amplitude, fstart is the frequency at t = 0, fs is the sampling
frequency, and k a constant. Its instant frequency varies linearly in time:
n

(n)
= k + fstart .
fs

(3.2)

Figure 3.1 (left) shows the spectrogram of a half second long discrete signal
which frequency sweeps from 100 to 4000 Hz. This spectrogram is obtained using
a Gammatone filter-bank, therefore is the frequency scale not linear (see (1.2)).
This signal is then convoluted with the impulse response shown on figure 2.7. The
resulting spectrogram is shown on figure 3.1 (right). We can observe, from this
spectrogram, that the sinusoidal component corresponding to the original signal
can be clearly identified. In each frequency band, the energy corresponding to this

3.1 Harmonicity based dereverberation

47

direct signal appears first and is followed by a reverberation tail. At a given


point in time, the energy of the signal is maximum for the frequency corresponding
to the direct signal.
of a dominant
The idea of the HERB method is to track the instant frequency (l)
sinusoidal component in the reverberant signal at each short time frame. The

amplitude A(l)
and phase (l)
of this dominant sinusoidal are extracted and
used to synthesize the signal


X
n

cos (l)
s(n) =
g(n nl )A(l)
+ (l)
,
(3.3)
f
s
l
where g(n nl ) is a window function for overlap-add synthesis and nl is the time
index centered in frame l.

3.1.2

Adaptive harmonic filtering

Although a sweep signal contains only one dominant sinusoidal, a harmonic signal contains several sinusoidal components whose frequencies correspond to its
fundamental frequency F0 and its multiples (cf. section 2.1). The aim of a harmonic filter is to enhance these components. Since the fundamental frequency of a
speech signal changes over time, the properties of the filter have to be adaptively
modified according to F0 (see figure 3.2).
x(n)

- Harmonic

Filter

- x
(n)

6
- Estimation

of F0
Figure 3.2: Adaptive harmonic filtering

A simple approach of harmonic filtering is the comb filter defined as 1+z where
is the period to be enhanced. The method proposed by Nakatani in [4] is to
filter the signal by synthesizing a harmonic sound as follows:
1. The fundamental frequency of the observed signal is estimated at each time
frame. If the time frame is short enough this fundamental frequency can be
considered constant.

48

3 Enhancement of a speech signal


2. The amplitudes and phases of individual harmonic components are estimated using the short time Fourier transform (STFT), X(l, m), of s(n)
X
m nnl
X(l, m) =
g1 (n nl )x(n)e2 M fs ,
(3.4)
n

Ak,l = |X(l, [kF0,l ])|,


k,l = X(l, [kF0,l ]),

(3.5)
(3.6)

where l is the index of the time frame, nl is the time index corresponding
to the center of the frame, m is the index of the frequency bin, M is the
number of points used for the Discrete Fourier Transform (DFT), Ak,l and
k,l are respectively the estimated amplitude and phase of the k-th harmonic
component, F0,l is the fundamental frequency of the time frame, g1 (n) is
an analysis window function and [] discretizes a continuous frequency into
the index of the nearest frequency bin.
3. The output of the filter, x(n), is obtained by adding sinusoids


X
n nl

Ak,l cos kF0,l


xl (n) =
+ k,l ,
fs
k

(3.7)

and combining them over succeeding frames


x(n) =

g2 (n (nl + lT ))
xl (n),

(3.8)

where xl (n) is a synthesized harmonic sound corresponding to the time


frame l, T is the frame shift in samples and g2 (n) is a synthesis window
function.
Actually the harmonic filter itself is easy to implement. The main issue is to find
an accurate estimate of the fundamental frequency of the signal even in case of
strong reverberation.

3.1.3

Dereverberation operator

Harmonic case
The dereverberation operation is computed in the frequency domain using the
short time Fourier transform. Let X(l, m) be the STFT of a reverberant signal.

3.1 Harmonicity based dereverberation

49

X(l, m) can be represented as the product of the source signal, S(l, m), and the
room transfer function, H(m), which is assumed to be time invariant (cf. section
2.2). This transfer function can be divided into two function, D(m) and R(m).
The former corresponds to the direct signal, D(m)S(l, m), and the latter to the
reverberant part, R(m)S(l, m):
X(l, m) = H(m)S(l, m)
= D(m)S(l, m) + R(m)S(l, m)

(3.9)

The aim of the dereverberation operator is to estimate the direct signal X 0 (l, m) =
D(m)S(l, m).
It can be obtained by subtracting the reverberant part R(m)X(l, m) from equation (3.9), or by finding the inverse filter W (m) such that
W (m) =

D(m)
H(m)

(3.10)

Then
X 0 (l, m) = W (m)X(l, m)

D(m)
=
H(m)S(l, m)
H(m)

(3.11)

= D(m)S(l, m).
The basic idea of the HERB method is the following: if S(l, m) is a harmonic
signal, the direct signal, contained in X(l, m) can be obtained using an adaptive
harmonic filter. At each time frame l an inverse filter W0 (l, m) is computed in
m) of the harmonic filter:
the frequency domain using the output X(l,
W0 (l, m) =

m)
X(l,
X(l, m)

(3.12)

m) is supposed to contain only the direct part of the signal


As the signal X(l,
X(l, m), this filter will remove the reverberation on the time frame.
As the effect of the room is supposed to be constant the dereverberation operator
W (m) is estimated by averaging the inverse filter computed at the different time
frames.
W (m) = E {W0 (l, m)}
(3.13)

50

3 Enhancement of a speech signal

General case
This process can be applied on a speech signal S(l, m) by rewriting the equation
(2.1) in frequency domain
S(l, m) = Sh (l, m) + Sn (l, m)

(3.14)

where Sh (l, m) is the harmonic part and Sn (l, m) is the non-harmonic part.
The observed reverberant signal X(l, m) is rewritten as
X(l, m) = D(m)Sh (l, m) + (R(m)Sh (l, m) + H(m)Sn (l, m))

(3.15)

where H(m) is the transfer function of the room, H(m) = D(m) + R(m).
The component D(m)Sh (l, m) can be approximately extracted from X(l, m) with
m) can be modan adaptive harmonic filter. This approximated direct signal X(l,
eled as:



X(l, m) = D(m)Sh (l, m) + Rh (l, m) + Hn (l, m)


(3.16)
h (l, m) is a part of the reverberation of Sh (l, m) and H
n (l, m) is a part of
where R
the direct signal and reverberation of Sn (l, m). It can be assume, if the fundamen m)
tal frequency is perfectly estimated, that the only estimation errors on X(l,
are caused by this two unexpected remaining parts.
The dereverberation estimator is computed as in the harmonic case 3.12:
(
)
m)
X(l,
W (m) = E {W0 (l, m)} = E
(3.17)
X(l, m)
Using the equation (3.16), the operator over a time frame W0 (l, m) can be rewritten as (the time and frequency indexes have been removed for more clarity)

h + H
n
DSh + R
X
=
X
HS





Hn /Sn Sn
D + Rh /Sh Sh
=
+
H (Sh + Sn )
H (Sh + Sn )
h /Sh
n /Sn
D+R
1
H
1
=
+
H
1 + Sn /Sh
H 1 + Sh /Sn

W0 =

(3.18)
(3.19)
(3.20)

1
It can be proven (see in appendix A) that the expected value of a function 1+z
,
where z is a complex random variable, is equal to the probability that |z| < 1, if

3.1 Harmonicity based dereverberation

51

it is assumed that the phase of z is uniformly distributed, the phase of z and its
absolute value are statistically independent, and |z| =
6 1.
Then, under the following conditions [10]:
h (l, m) and
1. The phase of Sn (l, m) and a joint event composed of Sh (l, m), R
|Sn (l, m)| are independent,
n (l, m)
2. The phase of Sh (l, m) and a joint event composed of |Sh (l, m)|, H
and Sn (l, m) are independent,
3. The phase of Sh (l, m) and Sn (l, m) are uniformly distributed within [0, 2),
4. |Sh (l, m)| =
6 |Sn (l, m)|,
the equation (3.17) can be derived as
W (m)

D(m) + R(m)
P {|Sh (l, m)| > |Sn (l, m)|},
H(m)

(3.21)

where
(

R(m)
= Et

h (l, m)
R
Sh (l, m)

)
,

(3.22)

|Sh (l,m)|>|Sn (l,m)|

P {} is a probability function, and Et {}A represents an average function over


time frames under a condition where A holds.
In the derivation of W (m), the term corresponding to the non-harmonic part
is neglected. Actually it is expected that, when |Sn (l, m)| > |Sh (l, m)|, i.e. the
signal is non-harmonic over the time frame, the output of the harmonic filter is
equal to zero, then
)
(
n (l, m)
H
0.
(3.23)
Et
Sn (l, m)
|Sn (l,m)|>|Sh (l,m)|

Given equation (3.21) W (m) is expected to remove reverberation not only of the
harmonic components of the speech signal but also of the non-harmonic ones. It
D(m)
approximates the inverse filter H(m)
except for a remaining reverberation due to

R(m).
The enhanced signal is then expected to be the sum of the direct signal and
a reduced reverberation part. However, because of the term P {|Sh (l, m)| >

52

3 Enhancement of a speech signal

|Sn (l, m)|}, the gain of the filter becomes zero in frequency regions where harmonic components were not included during the estimation of the dereverberation
operator.
In addition, even in frequency regions in which harmonic components were present
during the estimation phase, the filter gain is expected to decrease as the frequency increase. The reason is that in a speech signal the energy of higher order
harmonic components is smaller than the energy of the sinusoidal component at
the fundamental frequency and its small multiples.

3.1.4

The HERB method

The figure 3.3 summarize the dereverberation process using the HERB method.

S(l, m)

- H(m)


- W (m)

- X(l, m)

- S(l,
m)

@
@
?

Harmonic
Filter

R n
@

X(l,m)
X(l,m)


?

m)
X(l,
Figure 3.3: Diagram of the HERB dereverberation method

The system consists of the following sub-procedures:


1. Estimation of the fundamental frequency and the voiced durations.
2. Harmonic filtering.
3. Estimation of the dereverberation operator.
4. Dereverberation of the signal using this operator.
The estimation of the fundamental frequency is the most important point of the
method. F0 must be robustly estimated in order to expect a good dereverberation.
This task is difficult in presence of strong reverberation. Nakatani [10] proposes
to repeat the dereverberation process of figure 3.3 three times:

3.1 Harmonicity based dereverberation

53

STEP 1: The dereverberation process is applied on the observed reverberant


signal.
STEP 2: The dereverberated signal obtained in the first step is used to estimate
the fundamental frequency. This new F0 is used to control the adaptive
harmonic filter, but the input of this filter is the original reverberant signal.
STEP 3: The dereverberated signal obtained in the second step is now used as
new reverberant signal which is enhanced reapplying the whole dereverberation process on it.
According to [10], the quality of the dereverberated signal improves at each step.
The second step can be repeated several times to even improve the estimation of
the fundamental frequency. By contrast, the quality of the signal does not always
improve when the third step is repeated, this is due to an accumulation of errors
in the dereverberation filters.

3.1.5

Test of the method

The performance of this method described in [4] and [10] are impressing. We have
implement this method to test two points:
Can this method easily be adapted to replace the STFT by a Gammatone
filterbank?
Can the process work in real-time and real environments?
For HRI the interest of this method resides in the fact that the fundamental
frequency of the signal is already computed for other processes. A pitch tracking
algorithm has already been developed at HRI [11] and can valuably be used to
estimate the fundamental frequency of the signals. As the computation of the
fundamental frequency seems to be the critical point of the HERB algorithm, we
expected a lot from this method.

Harmonic Filter
The aim of this section is to compare the harmonic filter proposed by Nakatani in
[4] and an implementation of the harmonic filter on the Gammatone filter-bank.

54

3 Enhancement of a speech signal

The implementation of the harmonic filter on the filter-bank is simple. The output of the Gammatone filter bank are N signals corresponding to the different
frequency channels of the cochlea response. Knowing the fundamental frequency,
the frequency channels corresponding to F0 and its multiple are determined. At
a time sample t the signals of these channels are added.
The resulting spectrograms (see figure 3.4) shows that both implementation of
the harmonic filter give similar results.

Figure 3.4: up-left: Original signal (sweep with harmonics). up-right: Reverberant
signal. bottom-left: Harmonic estimate with the Gammatone filter-bank. bottomright: Harmonic estimate with Nakatanis harmonic filter.

As expected the adaptive harmonic filter can be implemented without problems


on the Gammatone filterbank. Moreover the assumption that the fundamental
frequency is constant over a short time frame is not required any more as our
filterbank performs the harmonic filtering without the subsampling imposed by
the STFT used in [4]. The improvement of the method, using time wrapping,
proposed in [12] is useless with a Gammatone implementation.

3.1 Harmonicity based dereverberation

55

Dereverberation operator
In order to compute the dereverberation operator a training sequence is required.
During this adaptation phase the room impulse response must not change.
In the simulation the training sequence was composed of several sweeping sinusoid with their harmonics similar as the one shown on figure 3.4. The operator is
computed using a short-time Fourier transform. The restriction on the time window is really strong. It has to be long enough to contain a whole word (sweep)
including the complete reverberation tail.
It is at this point important to note that the window of the short-time Fourier
transform for the harmonic filter and for the estimation of the dereverberation
operator can not be the same. For the harmonic filtering the analysis window must
be as small as possible in order to respect the assumption that the fundamental
frequency of the signal is constant. On the other hand, for the estimation of the
filter W (m), the time window of the STFT must be of several seconds.
It is also assumed that, during the adaptation phase of the dereverberation filter, the pause between the words is long enough. Hence the reverberation tail
of a word would alter the the following word. Respecting these conditions the
dereverberation operator can be computed.
In order to estimate the performance of the algorithm, 500 random harmonic
signals (sweeps with harmonics) of 0.5 s each are used as training data. These
signals are convolved with the room impulse response shown in figure 2.7. As the
exact fundamental frequencies of these signals are known, a good estimation of
the dereverberation operation can be expected. This dereverberation operator is
then used to enhance a real speech signal convolved with the same room impulse
response (see figure 3.5).
Figure 3.6 shows the spectrogram of the enhanced signal. It is here important
to note that the speech signal used for the test contains only one word. As the
time window of the STFT used to compute the dereverberation operator is long
enough to contain a whole word, the enhanced signal is obtained by multiplying
the FFT of the whole reverberant signal with the dereverberation operator.
The dereverberation works relatively good. However, we can see on the spectrogram that the dereverberation filter is not causal. This is not surprising as we
explained in section 2.3 that room impulse responses are in general non minimumphase. Because of the non causality the beginning of the signal is altered. It can

56

3 Enhancement of a speech signal

Figure 3.5: Spectrogram of the clean and reverberant signal used to test the reverberation operator.

Figure 3.6: Spectrogram of the enhanced signal computed in the frequency domain.

3.1 Harmonicity based dereverberation

57

be a problem as this part of the signal contains normally no reverberations and


can, therefore, be valuable for some processing.
We try to express the dereverberation filter in the time domain, using the inverse
Fourier transform of the operator computed in the frequency domain. In this case
the dereverberation does not work anymore (see figure 3.7).

Figure 3.7: Impulse response of the dereverberation operator and spectrogram of the
enhanced signal computed in the time domain.

3.1.6

Discussion of the method

This method can theoretically remove very strong reverberations of a signal.


Moreover, as the dereverberation operator is computed in the frequency domain,
the computational cost is quite low.
However, the amount of required adaption data is prohibitive (about 500 words).
Then the pause between words has to be really large in case of long reverberation.
It is therefore impossible to use this method in real-time application. It will be
quite bothering to have to speak during several minutes with ASIMO before he
begins to understand what is said, and that supposing that neither the robot or
the speaker move during this time.
Given the restrictions on the adaptive phase of the algorithm this method can
not be applied in real-environment. In addition, a remark can be formulated on
the use of harmonic filtering to remove reverberations, even for a highly harmonic
signal. The harmonic filter manages quite well to remove the reverberation when
the fundamental frequency changes within the word. However a big part of the
reverberation stays if the fundamental frequency changes to slowly. Figure 3.8

58

3 Enhancement of a speech signal

shows the effect of the reverberation on the fundamental frequency in such a


case.

Figure 3.8: Effect of the reverberation on the fundamental frequency.

In this example the component corresponding to the fundamental frequency is


strongly disturbed by the reverberation. Increasing the frequency resolution (the
number of channels of the filter-bank) solve this problem. But more than 1000
channels are required to get a frequency selectivity as good as for the human
acoustical system. Due to the increasing computational load this is not feasible
for real-time applications.

3.2 Dereverberation using LP analysis

3.2

59

Dereverberation using LP analysis

The dereverberation using the harmonicity of the signal require too much training
data. Therefore, in this section, another dereverberation method will be discussed.
This method uses the autoregressive (AR) model of speech signals. Several methods have been proposed using linear prediction (LP) analysis [13] [14] [15].

3.2.1

Problem formulation

In section 2.1.4 it was explained that a speech signal s(n) can be expressed as
a linear combination of its L past sample values. The clean and the reverberant
speech become, respectively,
s(n) =

p
X

ak s(n k) + es (n),

(3.24)

bk x(n k) + ex (n),

(3.25)

k=1

x(n) =

p
X
k=1

where ak and bk are the LP coefficients and es (n) and ex (n) the LP residual signal
(or prediction error).
The important assumption on which dereverberation methods using LP analysis
are based is that the LP coefficients are unaffected by the reverberation:
k [1, L] N.

b k = ak

(3.26)

Actually this assumption holds only in a spatially averaged sense [16], i.e. using
several microphones:
E {bk } = ak

k [1, L] N.

(3.27)

Consequently the dereverberation process will try to enhance the LP residual of


the signal which structure is well known (see 2.1.4). The aim of the dereverberation methods using LP analysis is to improve the LP residual signal such that
e(n) es (n). Then a clean speech estimate is obtained by
s(n) =

L
X

bk s(n k) + e(n),

(3.28)

k=1

i.e. the estimated LP coefficients obtained by linear prediction analysis are used
to synthesize a signal out of the enhanced excitation signal e(n).

60

3.2.2

3 Enhancement of a speech signal

The kurtosis as measure of the reverberation

Figure 3.9: Example of platykurtic (left) and leptokurtic (right) distributions. Both
distributions have the same standard deviation

Gillespie shows in [14] that the kurtosis of the LP residual is a valid reverberation
metric. The kurtosis 2 of a random signal x(n) is the degree of peakness of the
distribution, defined as the the fourth central moment 4 normalized by the fourth
power of the standard deviation (or the square of the variance):


E (x(n) )4
4
2 = 4 = 
(3.29)
2

E (x(n) )2
where = E {x(n)} is the mean value of x(n).
As the kurtosis of a normal distribution is equal to 3, a kurtosis excess , denoted
2 and defined by
4
2 = 4 3
(3.30)

is often used. A distribution with a high peak 2 > 0 is called leptokurtic, a flattopped curve 2 < 0 is called platykurtic, and the normal distribution is called
mesokurtic.
Figure 3.9 illustrates the kurtosis measure. The distribution on the right is more
peaked at the center, we tend to conclude that it has a lower standard deviation.
But, on the other hand, it has thicker tails, which usually means that it has a
higher standard deviation. If the effect of the peakness exactly offsets that of the
thick tails, the two distributions will have the same standard deviation.
For a clean voiced speech, the LP residuals have strong peaks corresponding to
glottal pulses (see figure 3.10), whereas for reverberated speech such peaks are
spread in time. On figure 3.11, the probability density function of a clean signal
and of the convolution of this signal with the room impulse response computed
in the CARL Groups office (see figure 2.7) are estimated. Both signals have
been centered and normalized such that their means equals 0 and their standard

3.2 Dereverberation using LP analysis

61

Figure 3.10: On the left, extract of the LP residuals of a speech signal. Note the strong
peaks corresponding to the glottal pulses. On the right, the same signal impaired by
reverberations.

Figure 3.11: Estimation of the probability density functions of the LP residuals of


a clean speech signal (blue) and of a reverberant signal (red). Both signal have been
centered and normalized such that their means = 0 and their standard deviations
= 1.

62

3 Enhancement of a speech signal

deviation equals 1. The probability density functions are estimated by computing


the histograms of the signals and normalizing them by the number of samples
in the signals. The distribution of the kurtosis of the clean signal (blue) has
a higher peakness (2 = 42). The effect of the room reduces this peakness on
the reverberant signal (red) and the kurtosis of its LP residuals equals 10. By
maximizing the kurtosis of the LP residuals we can expect to improve the quality
of the observed signal.

3.2.3

Maximization of the kurtosis

In the time-domain
In order to enhance the reverberant signal x(n) an adaptive filter can be built,
which maximizes the kurtosis of its LP residual x(n). Given an L-taps adaptive
filter h(n) at time n, the output of this filter is y(n) = hT (n)
x(n), where x
(n) =
T
[
x(n L + 1), . . . , x(n 1), x(n)] . An LP synthesis filter yields y(n), the final
processed signal. Adaptation of h(n) is similar to a traditional Least-Mean-Square
(LMS) adaptive filter [17], except that the optimized value is a feedback function
f (n), corresponding to the gradient of the kurtosis.
Figure 3.12 (a) shows a diagram of the maximization system. The problem of
this algorithm is the LP reconstruction artifacts. However, this system is linear
and the order of the filters can be arbitrary changed, then h(n) can be computed
from x(n), but applied directly to x(n) (see figure 3.12 (b)).
A gradient method can be used to optimize the kurtosis. The gradient of the
kurtosis is computed by
2
4 (E {
y 2 } E {
y3x
} E {
y 4 } E {
yx
})
=
3
2
h
E {
y }

(3.31)

This gradient can be approximated by


2

!
4 (E {
y 2 } y2 E {
y 4 }) y
x
= f (n)
x(n)
E 3 {
y2}

(3.32)

Were f (n) is the feedback function used to control the filter updates. For continuous adaptation, the expected values E {
y 2 } and E {
y 4 } are estimated recursively
by




E y2 (n) = E y2 (n 1) + (1 )
y 2 (n)
(3.33)
 4
 4

4
y (n)
(3.34)
E y (n) = E y (n 1) + (1 )

3.2 Dereverberation using LP analysis

63

Figure 3.12: (a) A single channel time-domain adaptive algorithm for maximizing the
kurtosis of the LP residuals. (b) Equivalent system, which avoids LP reconstruction
artifacts.

where < 1 controls the smoothness of the estimates.


The update equation of the filter is given by
h(n + 1) = h(n) + f (n)
x(n)

(3.35)

where controls the speed of adaptation.

In the frequency domain


However, according to Haykin [17] the convergence of a LMS-like algorithm in
time-domain is very slow. Therefore, Gillespie [14] proposes to adapt the algorithm in the frequency-domain. Moreover by using more microphones and calculating the feedback function on an averaged output of all the channels, the
accuracy and the speed of the adaptation is increased.
The frequency domain method proposed in [14] uses a modulated complex lapped
transform (MCLT) [18]. This filter-bank structure is close to a Discrete Cosine
Transform (DCT). The general diagram of the method in the frequency domain
for two microphone is described in figure 3.13.

64

3 Enhancement of a speech signal

Figure 3.13: Two-channel frequency-domain adaptive algorithm for maximization of


the kurtosis of the LP residual.

3.3

Discussion of the method

The maximization of the kurtosis permits a real-time dereverberation. The adaptation is quick if a small adaptation filter is used. However, in the case of strong
reverberation the improvement on the signal is not perceptible.
If the length of the adaptive filter is increased, the kurtosis is still maximized and
the algorithm converges to a signal with maximum kurtosis. But the resulting
signal has sometimes a higher kurtosis than the original clean signal. The sound is
strongly distorted and sometimes not understandable anymore. Figure 3.14 shows
the original LP residual of the clean signal. This signal is artifically reverberated
and then enhanced by maximizing the kurtosis of the LP residuals. The resulting
LP residual has a higher kurtosis than the original one. This means that the
maximization has to be constrained. The clean speech has a higher kurtosis than
the reverberated one, but this does not mean that the signal with the highest
kurtosis is the clean signal.
Actually the length of the adaptive filter must not be longer than the period
of the glottal pulses. With this constraint, the efficiency of dereverberation is
limited.
Another drawback of this method is that the LP analysis, as explained in section
2.1.4, is a very good approximation of the magnitude spectrum of the speech
signal but strongly alters the phases spectrum. As the phase is crucial for source
localization, it should be studied if this method does not alter dramatically the
phase information of the signal.

3.3 Discussion of the method

65

Figure 3.14: On the left the LP residual of a clean signal. On the right the LP residual
of the resulting dereverberated signal. The kurtosis of the dereverberated signal is
higher than the kurtosis of the original signal. The resulting signal is strongly distorted.

Chapter 4
Equalization of room impulse
responses
In chapter 3 the dereverberation approach considered the effect of the room as
a distortion which alters the harmonicity of the speech signal. This chapter will
discuss methods to estimate room impulse responses. These estimated impulse
responses can then be equalized (inverted) in order to recover the original clean
speech signal (see section 2.3).
In section 4.1 the principle of a channel estimation method using the second
order statistics of the observed signals will be explained. Then, in sections 4.2
and 4.3, two different implementations of this principle will be discussed. At last,
in section 4.4, some improvement ideas will be proposed.

4.1

Principle of the channel estimation

Some methods have been proposed to estimate one channel. For example, Hoopgood proposes in [19] a single channel estimation method based on the nonstationarity of speech and the stationarity of the room impulse response. However, in most of the cases, these methods require that the input signal is white
noise, which is not the case for a speech signal. On contrary the estimation of
several room impulse responses simultaneously is possible [20]. Moreover as it
was explained in section 2.3, it is much easier to find a global inverse for two or
more room impulse responses than the inverse of a single one. In this section a
method will be presented, which permits to estimate the impulse responses of a

68

4 Equalization of room impulse responses

Single-Input Multiple-Output (SIMO) system using only second order statistics


(SOS).

4.1.1

Hypothesis

In [20] Tong and al. show that a Single-Input Multiple-Output (SIMO) system
can be identified under the following conditions.
1. The autocorrelation matrix of the source signal is of full rank.
2. The channel transfer functions do not share any common zeros.

4.1.2

Basic idea

The relation between the input and the outputs of a SIMO system (see figure
4.1) is:
xi (n) = hi (n) s(n) i [1, M ]
(4.1)

s(n)

h1 (n)
h2 (n)
..
.

..
.

hM (n)

xM (n)

x1 (n)
x2 (n)
..
.

Figure 4.1: SIMO System

In a vector/matrix form, such a signal model becomes:


xi (n) = HTi s(n)

(4.2)

where

T
xi (n) = xi (n), xi (n 1), . . . , xi (n L + 1) ,

hi (0) hi (L 1)
0
..
..
..
..
Hi = .
.
.
.
0

hi (0)
hi (L 1)

T
s(n) = s(n), s(n 1), . . . , s(n 2L + 2) ,

(4.3)

(4.4)
(4.5)

4.1 Principle of the channel estimation

69

where L is the maximum length of the room impulse responses.


The idea of blind SIMO identification is to study the matrix
P

RxM x1
i6=1 Rxi xi PRx2 x1
Rx x
RxM x2
1 2

i6=2 Rxi xi
Rx =
(4.6)

.
.
.
.
.
.
.
.

.
.
.
.
P
Rx1 xM
Rx2 xM

i6=M Rxi xi


where Rxi xj = E xi (n)xTj (n) are the auto- and cross-correlation matrices of
the observed signals. The matrices Rxi xj can be written as
R xi xj =

1
Xi XTj
T

(4.7)

where Xi is the L (T + L 1) Sylvester matrix of xi (n) and T is the number


of samples of xi (n).
T

If the matrix Rx is multiplied by the vector h = hT1 hT2 hTM we obtain
(for the first L rows):
X
R x i x i h1 R x 2 x 1 h2 R x M x 1 hM =
i6=1

 1


1X
1
Xi XTi h1 X2 XT1 h2 XM XT1 hM =
T i6=1
T
T

1X
Xi XTi h1 XT1 hi
T i6=1
A left multiplication by the transpose of a Sylvester matrix is a convolution, then
the term XTi h1 XT1 hi actually equals

xi (n) h1 (n) x1 (n) hi (n) = s(n) hi (n) h1 (n) h1 (n) hi (n)
(4.8)
and, as the convolution of real signals is commutative, this term equals zero. The
same development can be performed for the other rows of the matrix product
Rx h and it gives:
Rx h = 0,
(4.9)
which means that the vector h lies in the null space of the matrix Rx .

4.1.3

How can this idea be implemented?

There are then two distinct approaches to identify the SIMO system:

70

4 Equalization of room impulse responses


An eigenvalue decomposition is performed on the matrix Rx and its null


space is computed [21]. Under the hypothesis that Rs = E s(n)sT (n)
is full rank, this null space corresponds to the unknown system h. This
batch method is discussed in section 4.2.
A set of filters gi (n) are adaptively estimated such that
(i, j), (hi (n) gj (n) hj (n) gi (n)) = 0 [22]. This iterative method is
discussed in section 4.3

4.1.4

Why have the channels to be coprime?

The second hypothesis, which requires that the channel transfer functions do not
share any common zeros, can be explained as follows:
Given for example two channels with impulse responses h1 (n) and h2 (n). If the
transfer function of these channels share common zeros then the impulse response
can be rewritten as
1 (n)
h1 (n) = d(n) h
2 (n)
h2 (n) = d(n) h

(4.10)
(4.11)

where d(n) is by analogy with polynomials the greatest common divisor of h1 (n)
1 (n) and h
1 (n) are coprime (do not
and h2 (n), and the transfer functions of h
share any common zeros). Then x1 (n) and x2 (n) become

1 (n)
x1 (n) = s(n) d(n) h
(4.12)

2 (n),
x2 (n) = s(n) d(n) h
(4.13)
and, if the correlation matrix of s(n) d(n) is full rank, the methods will identify
1 (n) h
2 (n)] instead of [h1 (n) h2 (n)].
the system [h

4.1.5

Estimation of the length of the filters

Both the batch and iterative implementations require that the lengths of the
channels are given. The estimation of these lengths is very important as we will
explain in this subsection.
In the two microphones case, the channel estimation tries to find two FIR filters
g1 (n) and g2 (n) of lengths Lg + 1 such that
h1 (n) g2 (n) h2 (n) g1 (n) = 0.

(4.14)

4.1 Principle of the channel estimation

71

where h1 (n) and h2 (n) are the two unknown FIR filters we want to identify. The
lengths of these filters are equal to Lh + 1, which is also unknown.
In the z-domain, the relation between the filters can be written as an operation
on polynomials
H1 (z)G2 (z) = H2 (z)G1 (z).

(4.15)

The polynomials H1 (z)G2 (z) and H2 (z)G1 (z) are equal if and only if they have
exactly the same Lh + Lg zeros.
As H1 (z) and H2 (z) do not share common zeros, each zero of H1 , resp. H2 (z)
must also be a zero of G1 (z), resp. G2 (z). G1 (z) and G2 (z) contain at least Lh
zeros and therefore Lg Lh .
When Lg = Lh , the method return directly the estimated channel. However,
when the lengths of the filters (or channel order) is over-estimated, additional
zeros appear. Figure 4.2 illustrates the system in the two-microphones case.
Null Space
-

H1 (z)

x1 (n) -

H2 (z)

El (z)
L

s(n)
-

H2 (z)

x2 (n) -

H1 (z)

- 0

El (z)

Figure 4.2: Channel identification with overestimated channel orders.

If we look at the estimated zeros of the channels in the z-plane, we can observe
these additional zeros on each channel (see figure 4.3 (left)). However these additional zeros are common to all the estimated channels (see figure 4.3 (right)).
The channel identification works if the channel orders are over-estimated and
the additional zeros in the transfer functions cause a distortion which can be
removed. On contrary, if the channel orders are under-estimated, the method will
not manage to estimate properly the channels. The relation between two channels
(equation (4.14)) can not be satisfied. The positions of the zeros of the estimated
channels largely differ from the positions of the zeros of the real channels. The
consequences on the estimated impulse responses, and especially on their inverses,
are disastrous.

72

4 Equalization of room impulse responses

Figure 4.3: Estimated zeros and real zeros for one channels (left). Zeros of all the
estimated channels. On the left 4 estimated zeros are alone, they do not correspond
to a real pole of the filter. On the right it can be noticed that these 4 additional zeros
are common to all the estimated channels.

4.2

Batch method

The batch method for SIMO system identification is well described in [21]. The
principle is to compute the eigenvalues of the cross-correlation like matrix Rx . In
L + 1 eigenvalues equal to zero, where L is the
the noiseless case there are L
the estimated one.
real order of the channels and L
On figure 4.4 a two-channel system is simulated. The speech source signal is
sampled at 16 kHz and the order of the two impulse responses is 100 taps. These
impulse responses are artificially generated by the multiplication of a white Gaussian noise with a exponential decay. The correlation matrix is compute for an
estimated filter order of 110. There are 11 eigenvalues equals to 0.

4.2.1

Extraction of the common part

As it was explained in the previous section the real impulse response of the room
is the common part of the eigenvectors which span the null space. A method to
compute this greatest common divisor is to write a matrix


K = g1 g2 . . . gLL+1
,
(4.16)

where the gi s are the eigenvectors of Rx corresponding to the eigenvalue 0. By


applying a QR decomposition on the transpose of the matrix K, the last row of
the R matrix is the desired common part up to a scaling factor. A proof of this

4.2 Batch method

73

Figure 4.4: Eigenvalues of the matrix Rx in the noiseless case. On the right: zoom on
the smallest eigenvalues.

theorem can be found in [21]. By performing rows rotations on the null space
matrix, several estimations of this common part can be obtained and averaged.
The QR decomposition of a matrix A is a factorization expressing A as

A = QR

(4.17)

where Q is an orthogonal matrix (QQT = I), and R is an upper triangular matrix. The matrix Q and R can be computed using the Gram-Schmidt method.
The Gram-Schmidt process of linear algebra is a method of orthogonalizing a set
of vectors in an inner product space. Orthogonalization in this context means the
following: we start with vectors v1 , . . . , vk (the column vectors of A) which are
linearly independent and we want to find mutually orthogonal vectors u1 , . . . , uk
which generate the same subspace as the vectors v1 , . . . , vk . The matrix Q represents the orthogonal basis vectors and the matrix R the coordinates of the vectors
v1 , . . . , vk in this new basis.
Figure 4.5 shows, on the left, 4 of the 11 eigenvectors corresponding to the eigenvalue 0 for the system presented in figure 4.4. On the right we can notice that
the normalized common part of the null space and the normalized original impulse responses are equal. The estimated impulse responses begin with 10 zeros
corresponding to the channel order over-estimation.

74

4 Equalization of room impulse responses

Figure 4.5: left: 4 of the 11 eigenvectors of the null space. right: common part of
the null space (blue) and real impulse response (red). The impulse response of the 2
channels are concatenated and 10 zeros (corresponding to the over-estimation of the
order) were added.

4.2.2

Noisy case

In the presence of an additive uncorrelated white noise,


yi (n) = hi (n) s(n) + b(n) = xi (n) + b(n)

(4.18)

the matrix Ry has no eigenvalues equal to zero. However, this matrix can be
approximated by
Ry Rx + Rb ,
(4.19)
and as Rb 2 I, where I is the identity matrix and 2 the variance of the noise,
L + 1 smallest eigenvalues will be 2 instead of zero. The corresponding
the L
eigenvectors will remain intact.
Figure 4.6 shows the effect of an additive noise on the eigenvalues for the example system. With a light noise the eigenvalues equal to 2 can clearly be
identified. However as the autocorrelation matrix of the speech signal has very
small eigenvalues (which are not equal to 0), if the signal-to-noise ratio decrease,
the smallest eigenvalues are hard to identify.

4.3

Iterative method

The batch channel estimation method requires to compute the eigenvalue of a


very big matrix. Even if this matrix is symmetrical, the required memory and

4.3 Iterative method

75

Figure 4.6: Eigenvalues of the correlation matrix in the noisy case. The variance of
the noise is equal to 1010 on the left and 106 on the right.

the computational load are prohibitive. Huang and Benesty propose in [22] an
iterative method which solves the problem using adaptive filtering instead of a
eigenvalue computation.
The iterative method directly uses the cross relations among the observed signals,
by following the fact that
xi hj = s hi hj = xj hi ,

i, j = 1, 2, . . . , M.

(4.20)

In absence of noise, it gives in vector notation at time n:


xTi (n)hj = xTj (n)hi

(4.21)

In presence of noise, or if the channel impulse response is not perfectly estimated,


an error signal is produced at time n + 1
j (n) xT (n + 1)h
i (n)
eij (n + 1) = xTi (n + 1)h
j

(4.22)

i (n)s are the channel responses estimated at time n.


where the h
The principle of the adaptive implementation is to minimize this error. In order
to avoid trivial solutions a unit-norm constraint is imposed on
h
iT

T (n) h
T (n) . . . h
M (n)T
h(n)
= h
(4.23)
1
2
leading to a normalized error signal
ij (n + 1) =

eij (n + 1)
.

kh(n)k

(4.24)

76

4 Equalization of room impulse responses

The cost function to minimize is then defined by


J(n + 1) =

M
1
X

M
X

2ij (n + 1),

(4.25)

i=1 j=i+1

and the update equation of the normalized multichannel LMS algorithm (NMCLMS) [10] is
+ 1) = h(n)

h(n
J(n + 1)
(4.26)
where is a small positive update step and J(n + 1) the gradient of the cost
function.
Contrary to the batch method, which give several estimation of the channels in
case of over-estimation of the channel orders, the iterative method does not offer
an easy post-processing method to remove the additional zeros of the estimated
filters.

4.3.1

Choice of the optimization method

Huang and Benesty [22] made two propositions to improve the optimization.
Firstly the efficiency of the computation is improved by using a frequency-domain
approach. The convolutions are computed with a Fast Fourier Transform (FFT)
and an overlap-save method.
Then, a Newton method instead of a gradient descent is used to speed up the convergence of the optimization but the Hessian matrix of the cost function has to be
computed. To lower the computational burden this matrix can be approximated
by a diagonal matrix and recursively computed, which reduces the computational
load of the algorithm.

4.3.2

Simulation

This method is faster than the batch implementation and requires less memory.
If the order of the channel is small, the adaptive process can be computed in
real-time. However, if only two microphones are used, the convergence takes too
long as speech signals are generally not well conditioned.
Figure 4.7 (left) shows the iteratively estimated zeros of one of the channels of the
example system presented in figure 4.4, without over-estimation of the channel
orders. In the left figure, it can be noticed that most of the zeros are close to

4.4 Improvement of the method

77

their real positions but some are very bad estimated. The estimated impulse
responses can be inverted using the MINT method (section 2.3). The sum of the
convolutions hi,original (n) gi (n) can be computed, where the hi,original (n)s are
the real impulse responses of the systems (which are known in this simulation)
and the gi s are the inverses obtained out of the estimated impulse responses.
The result is shown on figure 4.7 (right), it should be a unit-impulse (drawn in
red), but, as it can be seen, there are large deviations from the Unit impulse due
to the badly estimated zeros.

Figure 4.7: Iterative estimation of the channel impulse responses using two microphones. On the left the estimated zeros (blue) of one of the channels are compared
with their real values (red). On the right the remaining impulse response after inversion of the system is drawn (blue), in the ideal case it should a Dirac (red)

A solution to improve the estimation of the channels is to increase the number


of microphones. Figure 4.8 shows the same simulation as before with 5 microphones instead of 2. The estimation strongly improves and, after the inversion,
the remaining reverberation in the signal can be neglected.

4.4

Improvement of the method

The main issue of this channel estimation method is that the channel order has
to be overestimated. When it is underestimated the real impulse response can not
be found. Moreover the impulse response of a room is generally very long (more
than 10000 samples for a sampling frequency of 16 kHz). In real environments,
the required memory and the computational load are so high that it is impossible
to estimate the channels.

78

4 Equalization of room impulse responses

Figure 4.8: Iterative estimation of the channel impulse responses using 5 microphones.
On the left the estimated zeros (blue) of one of the channels are compared with their
real values (red). On the right the remaining impulse response after inversion of the
system is drawn (blue), in the ideal case it should a Dirac (red)

An idea to reduce the length of the room impulse response is to perform a subband
processing using a filter-bank and applying a downsampling of the signal in each
subband.
Given fj (n) the analysis filter of the subband j, the signals of channel i in the
subband j is
xi,j (n) = xi (n) fj (n)
= s(n) hi (n) fj (n)
= (s(n) fj (n)) hi (n)

(4.27)

= sj (n) hi (n).
In the subband j, the channel identification problem can be formulated the same
way as before by replacing s(n) with sj (n) = s(n) fj (n).
However, as the filter fj (n) is a band-pass, the new source signal sj (n) has a
limited bandwidth. Its autocorrelation matrix is then not full rank. The first
estimation hypothesis is not fulfilled anymore.
This means that the downsampling of the signal has to be performed not only in
order to reduce the computational load but is also required in order to fulfill the
estimation hypothesis.
However, after the subsampling, the observed subband signals do not seem correlated and the channel identification does not work. In order to find an explanation,
we perform a small simulation. Two signals are randomly generated and filtered

4.5 Discussion of the channel estimation methods

79

with a simple FIR low pass filter in order to be subsampled without aliasing. In
the first experiment the two signal a firstly convolved and then subsampled. In
the second experiment the same signal are subsampled and then the convolution
of the two resulting signal is computed. Figure 4.9 shows the position of the zeros
for the two obtained signals.

Figure 4.9: Comparison of the position of the zeros when the convolution and the
subsampling are performed in a different order.

We can notice that near the unit circle the zeros of the two channels coincides.
This was expected as the Fourier transform of the two signal are the same (up
to a constant factor). However the positions of the zeros far from the unit circle
highly differ.
As the channel identification method uses the transfer functions of the filters and
not only their frequency responses, the cross-relations between the channels will
be impaired by the subsampling and the channel identification will not work.

4.5

Discussion of the channel estimation methods

The channel estimation is theoretically very attractive. Using two microphones,


which is not a problem on ASIMO, it is possible to estimate the room impulse
responses, even in presence of an additive noise. These room impulse responses
can then be globally inverted with simple FIR filters and we get an exact dereverberation operator.

80

4 Equalization of room impulse responses

This method uses second order statistics which assures a robust convergence of
the estimation to the real room impulse response. Even if the batch method can
manage the overestimation of the channel order, the iterative implementation
should be chosen for computational and memory reasons.
All the recent publications on this subject try either to speed up the convergence
of the estimation [24] or to cope with the overestimation problem [25] of the
iterative implementation.
However even if the algorithms optimally converge and the common part of the
estimated channels can be removed it is an utopia to think that such a method
can work for real environment. For computational reasons, the adaption can be
performed in real time. But, even if a super-calculator is available, the probability
that the channels share common zeros logically increases as the number of zeros
increases and, moreover, the smallest eigenvalues of the autocorrelation matrix
of the original speech signal become smaller and smaller as the size of the matrix increases. Therefore the necessary conditions for the channel identification
certainly do not hold anymore when the channel order increases.
Moreover, even if a room impulse response has a finite duration, it is difficult to
estimate when infinitely small coefficients can be neglected. In case of underestimation, even if the truncated coefficient of the filters are small, the effect on the
channel estimation is important and after the inversion of the system strong reverberation will still be present. Sometimes this remaining reverberation is worse
than the reverberation on the observed signal.
At last the subband processing from which we expected to reduce the order of the
channels did not manage to find a channel estimation. This method can therefore
just work with very small channel orders (about 300 samples) and is unusable in
real environments (channel order of about 20000 samples).

Chapter 5
Conclusion and outlook
5.1

Review of the studied methods

This diploma thesis selected the most promising blind dereverberation methods
which exist today and investigated them. We will know review the comments we
formulated about those method after implementing them.

5.1.1

Harmonicity-based dereverberation

This method seems very promising. Especially concerning the implementation on


ASIMO. However the amount of training data required to correctly estimate the
dereverberation operator is prohibitive. Moreover, even if the training sequence
is not known, the condition on it are such that this methods can not really be
considered as a blind dereverberation method.
In a real environment, when the sound source and the microphone are in movement the dereverberation operator can not be estimated. However the idea that
an harmonic filtering enhances a harmonic signal impaired by reverberations can
be kept to improve the voiced part of the speech signal.

5.1.2

Linear prediction analysis

The linear prediction analysis permits to work on a practical parametric model


of speech. The dereverberation process has only to act on the LP residual of the

82

5 Conclusion and outlook

signal which greatly simplifies the task.


However, the enhancement of the LP residual has to use higher order statistics
(HOS) of the signal, e.g., in the presented method, the kurtosis. The used of
HOS causes convergence problems and the maximization has to be constrained
in order not to deteriorate the speech signal.
Moreover the simultaneous use of LP analysis and of the Gammatone filterbank
is not an easy task and the use of such a method on ASIMO implies a parallel
audio processing architecture.
At last, an all-pole model does not keep the phase information of the speech.
After LP analysis, the temporal information of the signal is contained in the LP
residual. As the effect of the kurtosis maximization on the phase is difficult to
control, we not sure that a source localization can be successfully performed after
the dereverberation.

5.1.3

Channel estimation

A dereverberation method using an estimation of the channel is very attractive.


In theory such a method, using multiple microphones can permit an exact dereverberation.
However, real room impulse responses are so long that the estimation is impossible
in practical cases. The length of the impulse responses can not be underestimated
and there are therefore so many parameters to simultaneously optimize, that, even
if the computation can be performed using effective algorithms, the convergence
is impossible to obtain.
Moreover the dereverberation is very sensitive to estimation errors, i.e. small
errors in the estimation can produce a catastrophic result after the inversion of
the room effect.

5.1.4

Direct comparison of the methods

Table 5.1 recapitulate some general information about the implemented methods.
For the kurtosis maximization the number of taps of the adaptive filter can be
increased, however in this case the maximization is not robust.

5.2 Speech model based method vs. channel estimation


HERB
Real-time
no
Max channel length several seconds
Exact dereverb.
no
always improve
yes
Robustness
yes

Kurtosis
maximization
yes
500 taps
no
yes
yes

83
Channel estimation
batch
iterative
for short channels
150 taps 300 taps
yes
yes
no
no
no
no

Table 5.1: Comparison of the methods

5.2

Speech model based method vs. channel estimation

On a first view the methods based on the estimation and the inversion of the room
impulse response seem more practical to use. Actually, as the effect of the room
is more stationary than speech, its parameters are easier to estimate. However,
such a method has no control on the quality of the output signal as it does not
take into consideration that the input and the output are speech signals.
On the other hand the methods based on a model of the speech will always improve the quality of the speech. However the optimization criterion are subjective,
i.e. the aim is that the speech sounds better. However, the phase spectrum, which
is less important for the speech perception is often neglected in such a method.

5.3

What should we decide for ASIMO?

If we should choose one of the studied methods to implement it on ASIMO, the


best one will probably be the maximization of the kurtosis, as it is the only
one which can work in real-time. However the number of taps of the maximization filter limits the efficiency of the dereverberation. A stronger maximization
filter can be used but in this case a method should be found to constrain the
maximization.
On the other hand the channel estimation method could give some valuable information about the shape of the room impulse responses, especially their first
coefficients. This means that we could look at the estimated channels but in no
case try to invert them.

84

5 Conclusion and outlook

In fact, the blind dereverberation problem seems not to be solvable on a signal


processing level. We should therefore rather try to develop speech recognition algorithms which are robust against reverberation. In that sense it seems advisable
to leave it to higher levels in the speech recognition pathway to cope with the
reverberations.

Appendix

Appendix A
Proofs
On page 50 the following theorem has been used to derivate the dereverberation
operator of the HERB method.
Given a complex function f (z) =
it is assumed that

1
1+z

where z is a complex random variable. If

the phase of z is uniformly distributed within [0, 2),


the phase of z and its absolute value are statistically independent and
|z| =
6 1,
then the expected value of f (z) is equal to the probability that |z| < 1:
E {f (z)} = P {|z| < 1}

(A.1)

As we do not found a proof of this theorem in the literature, we propose here a


possible one.
The idea is to decompose f (z) in infinite series. This decomposition is different
if the absolute value of z is smaller or greater than 1.
First case |z| < 1:
The function can be decomposed in the the infinite series
f (z) =

+
X
k=0

(z)k ,

(A.2)

88

A Proofs
this series converges as |z| < 1.
The complex value z can be written
z = re ,

(A.3)

where r and are respectively the absolute value and the phase of z. The
infinite series can then be rewritten as
f (z) =

+
X

rk ek(+) ,

(A.4)

k=0

and the expected value of f (z) becomes


E {f (z)} =

+
X



E rk ek(+) ,

(A.5)

k=0

as the expectation operator is linear.


Note: equation (A.5) holds if and only if the new series converges. This
convergence will be shown in the following.
As the absolute value and the phase, are statically independent each term
of the series can be rewritten as


 
E rk ek(+) = E rk E ek ek ,

k N.

(A.6)

As is uniformly distributed within [0, 2),



E ek =


(
1 if k = 0,
0 else,

(A.7)

and all the term of the infinite series are equal to zero except for k = 0.
As the infinite series is reduced to one term, it converges and its value is 1
(E {r0 } = 1).
Second case |z| > 1:
In this case f (z) can be rewritten as
f (z) =

1
z 1
=
1+z
1 + z 1

(A.8)

A Proofs

89

As |z 1 | < 1, the function can be decomposed in the infinite series


f (z) = z 1

+
X

(z 1 )k

(A.9)

k=0
+
X

= r1 e

rk ek ek

(A.10)

k=0

=
=

+
X
k=0
+
X

r(k+1) e(k+1) ek
0

rk ek e(k 1)

(A.11)
(A.12)

k0 =1

Then, as there is no term at k 0 = 0, the same reasoning as in the previous


case leads to E {f (z)} = 0.
It has be shown that
E {f (z)} =

(
1 if |z| < 1
0 if |z| > 1

(A.13)

As it is assumed that |z| =


6 1, the expected value of E(f (z)) is, in the general
case,
E {f (z)} = 1 P {|z| < 1} + 0 P {|z| > 1}
= P {|z| < 1}.

(A.14)

Bibliography
[1] Brian C. J. Moore. An introduction to the Psychology of Hearing. Academic
Press, 2003.
[2] B.C.J. Moore and B.R. Glasberg. Suggested formulae for calculating
auditory-filter bandwidths and excitation patterns. J. Acoust. Soc. Am.,
74:750753, 1983.
[3] Jr. John R. Deller, John H.L. Hansen, and John G. Proakis. Discrete-Time
Processing of Speech Signals. IEEE Press, 2000.
[4] Tomohiro Nakatani and Masato Miyoshi. Blind dereverberation of single
channel speech signal based on harmonic structure. IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1:9295,
2003.
[5] Jont Allen and David Berkley. Image method for efficiently simulating small
room acoustics. Journal of the Acoustic Society of America, pages 912915,
1979.
[6] Stephen G. McGovern.
A simple model
(http://www.steve-m.us/rir.html). web page.

for

room

acoustics

[7] Mingyang Wu and DeLiang Wang. A one-microphone algorithm for reverberant speech enhancement. IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), 1:844847, 2003.
[8] Alan V. Oppenheim and Ronald W. Schafer.
Prentice-Hall, 1975.

Digital signal processing.

[9] Masato Miyoshi and Yutaka Kaneda. Inverse filtering of room acoustics.
IEEE Transactions On Acoustics. Speech, And Signal Processing, 36(2):145
152, February 1988.

92

BIBLIOGRAPHY

[10] Jacob Benesty, Shoji Makino, and Jingdong Chen. Speech Enhancement.
Springer, 2005.
[11] Martin Heckmann, Frank Joublin, and Edgar Korner. Sound source separation for a robot based on pitch. In IROS, 2005. accepted.
[12] Tomohiro Nakatani, Keisuke Kinoshita, Masato Miyoshi, and Parham S.
Zolfaghari. Harmonicity based blind dereverberation with time warping.
Proc. ISCA Tutorial and Research Workshop on Statistical and Perceptual
Audio Processing (SAPA), October 2004.
[13] B. Yegnanarayana and P. Satyanarayana Murthy. Enhancement of reverberant speech using lp residual signal. IEEE transactions on speech and audio
processing, 8(3):267, May 2000.
[14] Bradford W. Gillespie, Henrique S. Malvar, and Dinei A. F. Florecio.
Speech dereverberation via maximum-kurtosis subband adaptive filtering.
IEEE International Conference on Acoustics, Speech, and Signal Processing
(ICASSP), 2001.
[15] Marc Delcroix, Takafumi Hikichi, and Masato Miyoshi. Dereverberation of
speech signals based on linear prediction. INTERSPEECH - ICSLP, 2004.
[16] Nikolay D. Gaubitch, Patrick A. Naylor, and Darren B. Ward. On the use of
linear prediction for dereverberation of speech. In International Workshop
on Acoustic Echo and Noise Control (IWAENC), 2003.
[17] Simon Haykin. Adaptive Filter Theory. Prentice Hall, 2001.
[18] Henrique Malvar. A modulated complex lapped transform and its applications to audio processing. Technical report msr-tr-99-27, Microsoft Research,
May 1999.
[19] James Robert Hopgood. Nonstationary Signal Processing with Application to
Reverberation Cancellation in Acoustic Environments. PhD thesis, Queens
College, University of Cambridge, 2000.
[20] L. Tong, G. Xu, and T. Kailath. A new approach to blind identitication
and equalization of multipath channels. Proc. 25th Asilomar Conf. (Pacific
Grove, CA), pages 856860, 1991.
[21] Sharon Gannot and Marc Moonen. Subspace methods for multi-microphone
speech dereverberation. Technical report, CCIT Report 398, Technion Israel Institute of Technology, Haifa, 2002.

BIBLIOGRAPHY

93

[22] Yiteng Huang and Jacob Benesty. A class of frequency-domain adaptive approaches to blind multichannel identification. IEEE Transactions on Signal
Processing, 51:1124, January 2003.
[23] Zhu Liang Yu and Meng Hwa Er. Blind multichannel identification for
speech dereverberation and enhancement. IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), 4:105108, 2004.
[24] Zhu Liang Yu and Meng Hwa Er. A robust adaptive blind multichannel identification algorithm for acoustic applications. IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), 2:2528, 2004.
[25] Takafumi Hikichi, Marc Delcroix, and Masato Miyoshi. Blind dereverberation based on estimates of signal transmission channels without precise information on channel order. IEEE ICASSP Processing, 1:10691072, 2005.