Sie sind auf Seite 1von 15

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO.

3, MARCH 2007

823

Multicomponent AMFM Representations:


An Asymptotically Exact Approach
Francesco Gianfelici, Giorgio Biagetti, Member, IEEE, Paolo Crippa, Member, IEEE, and
Claudio Turchetti, Member, IEEE

AbstractThis paper presents, on the basis of a rigorous mathematical formulation, a multicomponent sinusoidal model that allows an asymptotically exact reconstruction of nonstationary speech
signals, regardless of their duration and without any limitation in
the modeling of voiced, unvoiced, and transitional segments. The
proposed approach is based on the application of the Hilbert transform to obtain an amplitude signal from which an AM component
is extracted by filtering, so that the residue can then be iteratively
processed in the same way. This technique permits a multicomponent AMFM model to be derived in which the number of components (iterations) may be arbitrarily chosen. Additionally, the
instantaneous frequencies of these components can be calculated
with a given accuracy by segmentation of the phase signals. The
validity of the proposed approach has been proven by some applications to both synthetic signals and natural speech. Several comparisons show how this approach almost always has a higher performance than that obtained by current best practices, and does
not need the complex filter optimizations required by other techniques.
Index TermsAMFM speech model, envelope estimation,
Gabor signal, Hilbert transform, multicomponent modeling,
sinusoidal model.

I. INTRODUCTION
INUSOIDAL models, as defined by McAulay and Quatieri
in [1], are highly parametric representations of speech signals, based on physiologic properties of speech production and
perception. This characterization can be assimilated to the joint
action of both amplitude modulation (AM) and frequency modulation (FM), where neither the carriers nor the amplitude envelopes and the instantaneous frequencies (IFs) are known, and
therefore need to be estimated. Parametric representations of the
above kind can be classified on the basis of the number of components, as: 1) monocomponent or 2) multicomponent models.
This classification directly affects the number of envelopes and
IFs that need to be estimated, and finally the demodulation technique that has to be used.
The theory of the monocomponent representation is well
established, and a large number of demodulation techniques
based on different approaches have been developed in the last
decade. The relevance assumed by the TeagerKaiser operator

Manuscript received August 25, 2005; revised August 30, 2006. The associate
editor coordinating the review of this manuscript and approving it for publication was Dr. Rainer Martin.
The authors are with the Dipartimento di Elettronica, Intelligenza Artificiale
e Telecomunicazioni (DEIT), Universit Politecnica delle Marche, I-60131
Ancona, Italy (e-mail: f.gianfelici@deit.univpm.it; g.biagetti@deit.univpm.it;
pcrippa@deit.univpm.it; turchetti@deit.univpm.it).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TASL.2006.889744

[2] and the Hilbert transform [3] makes them the current
best approaches. Both techniques use ad hoc (low-pass and
bandpass) filters in order to regularize the large variations that
affect estimation of both the envelope and the IF of the signal.
In fact, these techniques perform well in stationary signal
modeling (e.g., in the modeling of artificially synthesized
signals) where parameter variations are of limited entity, but
they are not adequate for nonstationary signals (such as speech
signals), mainly because of the large and fast excursions of the
pitch period (inverse of fundamental frequency) that typically
occur in such signals. The performance of the aforementioned
techniques can be enhanced, at least on some signal subparts,
by accurately filtering and windowing the signal before the
demodulation process. Thus, this approximate nature of parameter extraction in sinusoidal modeling does not permit an
exact reconstruction of the original signal. This aspect causes
considerable difficulty in the modeling of signal subparts such
as transitions between phonemes, where the nonstationary
nature of signals determines large variations in signal dynamics
(attacks and closures), variations that in turn produce some
well-known undesired phenomena such as pre-echoes and
distortions. Additionally, model parameters are highly sensitive
to frame segmentation, so that when long frames are used, the
time resolution is inadequate for capturing signal dynamics
such as attack transients. On the other hand, when short frames
are used, the degradation affects frequency resolution. In both
cases, estimation of sinusoidal components becomes difficult,
as stated by Goodwin in [4]. Therefore, the assumption on
which the sinusoidal model is based, that is, model parameters
are slowly time-varying quantities, is difficult to satisfy in every
frame or in transitions between adjacent ones [5].
The above limitations have determined the development of
models that generalize the QuatieriMcAulay model and they
are usually based on mixed approaches such as the exponential sinusoidal model (ESM) [6][8], exponentially damped sinusoids (EDSs) [9], damped delayed sinusoids (DDSs) [10],
[11], and partial damped and delayed sinusoids (PDDSs) [5].
In these cases, model parameters are estimated by means of approximate techniques, which allow the control of modeling error
and, under adequate conditions, a multicomponent characterization of signals. The limitations of the QuatieriMcAulay model,
previously described for monocomponents, also exist for multicomponents.
The great interest in the theory of multicomponent modeling
and the absence of a rigorous closed-form formulation of this
problem in fact represent the starting point for the formalization and development of specific approaches and suitable techniques. An accurate description of the currently adopted best
approaches is proposed in [12]. A recent development of one

1558-7916/$25.00 2007 IEEE

824

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 3, MARCH 2007

of these techniques can be found in [13], where the parameters of the sinusoidal components are estimated by means of
likelihood maximization over the windowed signal. Moreover,
Huang et al. [14] proposed a different analysis technique suited
for nonlinear and nonstationary data, the empirical mode decomposition (EMD), which is based on the iterative extraction
of intrinsic modes, followed by the application of the Hilbert
transform to compute a spectrum from them. This iterative technique has been applied to many different application fields, such
as seismology, oceanography, and the processing of biological
data, but its applicability to the speech processing area has not
yet received much attention in literature. Another interesting
iterative approach [15] identifies pure sinusoids immersed in
noise by means of iterated filtering.
In this paper, we present an innovative iterative approach to
compute an asymptotically exact multicomponent sinusoidal
model of speech signals, based on the iterated application of
the Hilbert transform to a filtered version of the amplitude
envelope, and on the exact computation of these amplitudes
and associated phases. The algorithm can be applied to signals
without limitations on their duration, the number of components
to be extracted, and the desired modeling accuracy both for
stationary and transient signal portions. Finally, for the purpose
of completing the FM decomposition, an a posteriori adaptive
segmentation algorithm is used to extract arbitrarily accurate
instantaneous frequencies from the phase signals previously
obtained.
This paper is organized as follows. In Section II, a brief presentation of monocomponent and multicomponent sinusoidal
models, their extensions, and their associated extraction techniques, is given. In Section III, the mathematical formulation of
an iterative approach for accurate demodulation of multicomponent AMFM signals, based on an iterative application of the
Hilbert transform, is presented. In Section IV, an adaptive segmentation algorithm based on linear regression, for piecewiseconstant IF calculus, is introduced. In Section V, the proposed
technique is applied to synthetic signals in order to demonstrate
its behavior and to compare its performance with several other
methods. In Section VI, some examples of the application of the
proposed technique to natural speech signals are shown. Finally,
Section VII concludes this paper.
II. SINUSOIDAL MODELING
A sine wave, whose amplitude and instantaneous phase are
time-varying quantities, can be considered as a monocomponent
AMFM signal. Although this representation is in principle able
to represent arbitrary signals, it will be unsatisfactory when the
source is known to contain a mixture of components. In this
case, it is more suitable to consider a model composed of the
superposition of signals of this kind. This gives rise to what is
generally known as a multicomponent AMFM signal.
In the following paragraphs, a brief rsum of both monoand multicomponent modeling techniques is given.
A. Monocomponent Model
A monocomponent AMFM signal is a sine wave defined by
(1)

and
represent the amplitude and the instantawhere
neous phase, respectively. It is worth noting that the derivative
represents the IF
.
of
The best performing techniques for estimating the AM and
the FM modulating signals are based on: 1) the Teager energy
operator [16] and 2) the Hilbert transform [3].
The first technique, called the discrete energy separation algorithm (DESA), is a nonlinear differential approach best suited
for narrowband signals, although it can be generalized to large
frequency deviations as has recently been proposed [17].
Its mathematical formulation is defined in the discrete timedomain, according to the notation used in [17], as
(2)
where the derivative operation, which takes part in
TeagerKaiser energy operator, is approximated by
symmetric difference. In this case, the parameters of the
, i.e., the envelope
crete-time AMFM signal
, are calculated as in [16]
the IF

the
the
disand

(3)
and
(4)
The second technique is based on the Gabor analytic rep, which makes use of the Hilbert
resentation of the signal
be the complex signal defined as
transform [3]. Namely, let
(5)
is the Hilbert transform
where the quadrature signal
of
.
can be equivalently defined through the
Fourier transform (FT) as
(6)
and
being the FTs of
and
, respecwith
tively. In this case, the envelope
and the IF
are
given by
(7)
and
(8)
Both the above techniques require suitable low-pass and/or
bandpass filters to reduce the large variations in the FM parameters that would arise from direct application of these parameter extraction algorithms. An accurate comparison between algorithms using these two approaches, such as the energy operator separation algorithm (EOSA), the smoothed energy operator separation algorithm (SEOSA), and the Hilbert transform
separation algorithm (HTSA), can be found in [18].

GIANFELICI et al.: MULTICOMPONENT AMFM REPRESENTATIONS

825

B. Multicomponent Model
A multicomponent AMFM sinusoidal model [12] of the
can be represented as
signal

(9)
where is the total signal duration,
is the number of comis the amplitude envelope,
is the center freponents,
is the instantaneous
quency (or frequency centroid), and
phase of the th component, which is also called resonance [18].
Generally,
is required to be a slowly time-varying signal,
and should be constant in the time domain. Because it is imas constant over the full signal length,
practical to consider
using a slowly varying or piecewise-constant function is genis
erally accepted. For this purpose, the total time span
intervals
,
, and a condivided into
stant frequency centroid is defined in each of them, so that (9)
can be rewritten as
(10)
with

(11)
are the center frequencies inside the intervals
. These intervals can generally have different lengths,
and depending on how they are obtained (i.e., by fixed-length
windowing, by segmentation algorithms, or with a combination
of these techniques) different relationships exist between their
lengths. As a direct consequence of this characterization, in nonstationary signals, such as speech signals, a smooth behavior of
in transitions between interval boundaries cannot be guaranteed.
The multicomponent sinusoidal models inherit the limitations
of the monocomponent models they can be considered a generalization of. Additionally, the need to identify the components
in the frequency domain in a systematic way makes the multicomponent models more complex. Generally, the effectiveness
of these models is related to the type of signal, its properties, the
required accuracy, and the validity of nonobvious assumptions.
According to the characterization proposed in [12], the vast
majority of demodulation methodologies for a multicomponent
AMFM signal are based on the following techniques:
1) state space estimation;
2) Hankel and Toeplitz matrices;
3) linear prediction;
4) energy demodulation;
5) maximum-likelihood estimation.
Additionally, a structure based on a phase-locked loop (PLL)
can be used to separate two components, as in the work of BarNess et al. [19] where two couples of in-phase and quadrature
where

signals are used to extract the amplitude envelopes of the components, and feedback loops progressively correct each others
component estimation.
These techniques generally provide only an approximate reconstruction of the original signal. The main limitations derive
from the nonexact nature of parameter extraction and the restrictions imposed by the assumptions on which the solutions
be the estimation of the original signal
are based. Letting
, it is thus possible to write

(12)
,
and
are the estimated parameters. The
where
residue, or the modeling error, is
(13)
It is generally well known that these modeling errors are critical as they cause the appearance of pre-echoes and distortions.
contains information about events that are
On the one hand,
localized in the time domain1 and are critical in the modeling because they are not taken into account by the parameter extraction
techniques. On the other hand, reducing the modeling error
requires the windowing to be done on a large timescale for stationary parts and a small timescale for transitions. These windows are determined before (and independently of) parameter
extraction, and they are generally based on empirical techniques
[4]. The empirical nature of these techniques does not guarantee
the satisfaction of the sinusoidal model basic assumption, i.e.,
the slow variability inside frames and during transients. In order
to overcome these limitations, generalizations of the sinusoidal
model have been developed, such as the ESM, the exponentially
damped sinusoids (EDS), the DDS [10], [11], and the PDDS [5].
These techniques, after performing a segmentation of the signal
into very short time intervals, model the large variations in the
signal dynamics by means of exponential functions of the time.
The increase in number of parameters inevitably degrades the
accuracy of their estimation. Therefore, the effectiveness of the
model would be limited for the sole purpose of improving parameter estimation during transitions.
III. MULTICOMPONENT ASYMPTOTICALLY EXACT
AMFM DECOMPOSITION
With the above considerations in mind, it can be stated that
a multicomponent model ought to be extracted by using an
approach that is able to determine the parameters step by step,
allowing an exact signal reconstruction after each step, and
without any constraint on the characteristics of the signal to be
decomposed. In this section, we present an iterative decomposition method [21], developed to suit these requirements, and
show how the modeling error rapidly vanishes as the iteration
proceeds.
1It is well known that signal transformations in the time domain are very critical and that the arbitrary elimination of components (even if) with low energy
could impair the intelligibility of the original signal [20].

826

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 3, MARCH 2007

nating component, it is possible to proceed with the decomposition by using the relations

(16)
so that it results
(17)
which, once placed inside of (15), yields

(18)
Having made use of the Werner trigonometric formula for the
cosine product, we thus obtain
Fig. 1. Terms a (t) (thin line) and a (t) (thick line) for the sample speech
signal, the Italian word settimana, that will be used next. The term a (t) can
be obtained as the difference between a (t) and a (t) according to the filter
formulation.

(19)
where

A. Mathematical Formulation
Let
be a generic speech signal. By virtue of (5) it can be
written as

(14)
denotes the real part of a complex number,
where
, and
. Our aim here is to derive a multiby means of iterated applicacomponent decomposition of
tions of representations like (14) to the amplitude component.
Let us denote by the index the generic terms corresponding
to those in (14) for the th iteration of the decomposition procedure. In the case of speech signals, and in general for every
signal that contains a mixture of sinusoids, the amplitude
of the Gabor signal
exhibits an oscillating behavior. Of
is always nonnegative, these oscillations do
course, since
is thus unsuited for direct
not occur around the origin, and
treatment with further Hilbert transforms. Before performing
another transform, it is in fact necessary to separate its trend
[22] from the alternating component
, by means of
itself,
a suitable adaptive filtering algorithm acting upon
so that
(as is depicted in Fig. 1) and the
is a zero-mean oscillating signal that can thus be
residual
iteratively decomposed as in (14). The filter used to obtain the
residual can be defined in several different ways, but to guarantee convergence it must behave as a high-pass filter designed
of the total signal energy is kept
so that only a fraction
in the alternating component, as will be shown next.
as the first step of this iterative
Formally, starting with
algorithm, we can write

(20)
A detailed description of the subsequent steps will be given in
the Appendix, where it will be shown that the number of components increases geometrically with the number of iterations.
, and using the generalization
Letting this latter number be
of (17)
(21)
it results in

(22)
This is a generalized multicomponent sinusoidal model, in
can be iteratively computed as
which the phases
(23)
(24)
for
and
.
A remarkable property holds for this signal representation,
namely, as increases, the last term in (22)
(25)
rapidly vanishes. To prove this asymptotic behavior, since
is the high-pass filtered counterpart of
, by means of the
Parseval equality

(15)
where

. By denoting with
the (complex) Gabor signal associated with the alter-

(26)

GIANFELICI et al.: MULTICOMPONENT AMFM REPRESENTATIONS

827

we can write
(27)
and
are the Fourier transforms of the corwhere
and
, and
is the transfer
responding signals
function of the high-pass filter. Let be the filter energy loss,
defined as
(28)
For a given
, it is therefore possible to adaptively design
the filter so that its transfer function
retains a fraction
of the signal energy. We can thus write
(29)
Since the Hilbert transform preserves energy, the Gabor signal
energy will be twice that of the original signal, i.e.,
(30)
so that
(31)
hence, since
,
Finally from (25)

(32)
it results in
Having shown that

(33)

Fig. 2. Scheme of the complete envelope and phase extraction algorithm.

(22) can be rewritten as


B. Algorithmic Formulation

(34)
where
(35)
and
(36)
Equation (34) is, by virtue of (33), an asymptotically exact dein terms of amplitude and phase
composition of the signal
envelopes. In Section VI, the convergence behavior will be discussed in more depth, and some examples of the truncation
error as a function of , showing that good approximations are
achieved even with low values of , will be reported.

A sketch of the algorithm flow for the implementation of the


signal representation (34) is depicted in Fig. 2. The basic iteration computes amplitude and phase of the Gabor signal, obtained through Hilbert transformation, and then decomposes the
amplitude by filtering, as the sum of amplitude envelope (lowpass) and amplitude residue (high-pass). The latter must be further decomposed iteratively. The extracted amplitude envelope
is ready to enter the model, while the (elementary) phase needs
to be combined with the previously extracted ones. Specifically,
the last extracted component needs to be added to all the possible linear combinations with 1 coefficients of the previously
extracted components in order to reflect the composition rule for
cosines stated in (23) and (24).
After iterations, the model is thus composed of
parameter pairs, representing amplitudes and phases of the sinusoidal components. Of course, due to (23), (24), and (36), not all
these parameters need to be separately computed or stored. The
number of different amplitude envelopes is only
, since

828

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 3, MARCH 2007

a single envelope is added after each iteration. Similarly, only


elementary phases suffice to compute all the others.
IV. ADAPTIVE SEGMENTATION FOR INSTANTANEOUS
FREQUENCY CALCULATION
This section shows how to obtain a decomposition in terms
of instantaneous frequencies from the phase envelopes derived
in Section III.
The model stated in (34)
(37)
is actually an amplitude-phase modulation which can easily be
converted to an AMFM model by letting
(38)
and
(39)

with
being the modeling error, and
the instantaneous frequency estimate.
Several methods have been proposed in the literature to es, and most try to extract instantaneous frequencies
timate
that are constant in the time domain (stationary condition) or
slowly time-varying (semistationary condition).
In order to better satisfy these conditions, the currently
employed techniques split the signal into short intervals, to
exploit the semistationary nature of speech frames. Nevertheless, most of these IF estimation techniques, such as short-time
Fourier transform (STFT) [23], multiband demodulation analysis (time-varying Gabor filterbank) [24], peak tracking of
short time spectra [1], matching pursuit technique [25], and
instantaneous frequency attractors [26], are very sensitive to the
segmentation method they adopt (windowing, frame division).
To alleviate this problem, several adaptive segmentation techniques, operating a subdivision of the signal into intervals over
which all the sinusoidal model parameters are to be estimated,
have recently been developed [4], [27][29].
However, the absence of a direct connection between segmentation and IF extraction necessarily undermines the achievable
accuracy.
In the approach presented here, amplitude and phase envelopes can be computed without the need for segmentation.
Instead, segmentation is used to extract IFs from the phase
envelopes so that a much simpler algorithm to be applied a
posteriori to the unwrapped phases of the signal would suffice.
to be piecewise constant in a set of time
We assume
intervals which are adaptively estimated a posteriori, in order
to satisfy an upper-bound error
(40)
with
being the desired accuracy.
From (40), the adaptive segmentation problem can be stated
as the problem of finding a set of disjoint time spans

Fig. 3. Sketch of the adaptive segmentation algorithm for IF extraction.

that cover the whole


interval, so that
is
constant in each
and (40) holds.
The proposed algorithm is sketched in Fig. 3 and is based
on the search for an appropriate segmentation that satisfies the
above requirements. With
being typically a noisy signal,
the value
that the piecewise-constant function
assumes in the interval
can be estimated over finite intervals
as the linear regression of the data provided by
, for linear
regression is known to be a robust technique for computing instantaneous frequencies from noisy phase signals.
The algorithm starts with
and
and computes
the derivative of the unwrapped phase by means of linear regression over the interval
, increasing
until the condition
(40) is no longer met
. The largest interval that
satisfied the condition is recorded as one of the , then
is
advanced to the last
that satisfied the condition and the procedure is iterated until
.
The result obtained in this way is an estimation of the IF
, and the modeling error
can be made arbitrarily
small. The accuracy is of course directly related to the number
of intervals produced, and the algorithm easily allows the introduction of a signal-dependent bound
to take into account
phenomena like pre-echoes and distortions.

V. MODEL APPLICATION TO SYNTHETIC SIGNALS AND


COMPARISON WITH THE STATE-OF-THE-ART
In this section, the behavior of the proposed modeling technique, based on the iterated Hilbert transform (IHT), is analyzed

GIANFELICI et al.: MULTICOMPONENT AMFM REPRESENTATIONS

829

Fig. 5. Amplitudes of the two demodulated chirp components.

Fig. 4. (a) Synthetic signal given by (41) used for testing the decomposition
algorithm. (b) The first two amplitude envelopes a (t) (solid line) and a (t)
(dashed line). (c) The corresponding unwrapped phases (t) (solid) and (t)
(dashed).

with applications to a few synthetic signals, which were chosen


to validate its effectiveness. Several comparisons of IHT performance with the state of the art are also described in this section.
Empirical mode decomposition (EMD), periodic algebraic separation and energy demodulation (PASED), and multiband energy separation algorithm (MESA) were considered for this purpose. The decomposition capabilities of these techniques have
been investigated using the same synthetic signals, composed
of two components, with and without additive Gaussian noise
superposed on them. Moreover, the convergence properties and
processing times of the first two techniques, i.e., IHT and EMD,
which sequentially extract signal components by means of iterative algorithms, were also analyzed.
Let us consider the two-component AM synthetic signal
(41)
with
(42)
where
Hz,
Hz, and
Hz. This
signal is shown in Fig. 4 along with the first two amplitude envelopes and the corresponding phases as extracted by the IHT
algorithm. It is easy to note that the two amplitude envelopes acand
.
curately match the original modulating signals
(solid
The phase curves reported in Fig. 4(c) correspond to
(dashed line), and they have a mean slope of
line) and
500.09 and 49.97 Hz, respectively. The first slope corresponds
exactly to the carrier frequency , while the latter needs to be
combined with the former (added, in this case) to obtain the carrier frequency , as already explained in Section III-A.
The frequency separation used in the above example is 10%.
To highlight how the algorithm behaves as the frequencies of the

components become closer, two crossing chirp signals,


and
, were considered. The resulting synthetic signal
is composed of two sinusoids whose frequencies vary
linearly with time and cross each other, while their amplitudes
are held constant. In formulas
(43)
with
(44)

(45)
where
,
,
Hz,
kHz, and
is the sweep duration.
Figs. 5 and 6 show the algorithm capabilities in separating
amplitudes and IFs, respectively. As the component frequencies vary (increase/decrease) from to , the 1-to-8 ratio (18
dB) between the two amplitudes is correctly recognized and
the components are well separated, except for a small region
near the crossing instant, as can be seen in Fig. 5. Fig. 6 shows
,
, and
of the demodulated compothe IFs
nents. Dashed straight lines represents the original frequencies
of the two chirps
and
. As can be seen, the
demodulated components closely follow the chirps also during
the crossover, although there is a discontinuity in the labeling
and
were
at the intersection point. In fact, since the IFs
and , there is one
obtained by combining the components
degree of freedom in the labeling of the IFs, although the labels
need to be switched from one IF to the other at the crossover
point when trying to track the smaller of the two chirp components. It is important to note that this problem is also common to
other algorithms. Indeed, among those considered for the comparison, only PASED does not have this problem and is able to
track the components also across the intersection point.

830

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 3, MARCH 2007

TABLE I
PERFORMANCE COMPARISON OF VARIOUS AMFM METHODS APPLIED TO NOISELESS SIGNALS (f = 500 Hz,
f = 550 Hz, SIGNAL-TO-NOISE RATIOS SNR AND SNR ARE MEASURED IN DECIBELS)

TABLE II
PERFORMANCE COMPARISON OF VARIOUS AMFM METHODS APPLIED TO NOISY SIGNALS (f = 500 Hz,
f = 550 Hz, SIGNAL-TO-NOISE RATIOS SNR AND SNR ARE MEASURED IN DECIBELS)

of the PASED algorithm itself was based on Santhanams original code [12], with the addition of Gaussian filters used to
smooth the signal between the algebraic decomposition and energy separation blocks.
The filters used in the MESA and PASED algorithms were
centered around the known carrier frequencies, by providing
the true values of these, and their bandwidths were selected so
that their frequency responses cross each other at the half-peak
height, in order to satisfy the optimality criteria described in
[31]. It is worth noting that EMD and IHT algorithms do not
require this shrewdness in the choice of filters.
A series of two-component AM synthetic signals of the same
kind as the one described in (41) was used for this comparison,
between 0.1 and 500. For the two
obtained by varying
components, the signal-to-noise ratios, SNR and SNR , were
defined as

Fig. 6. Frequencies  ,  , and  of the demodulated chirp components.


The dashed straight lines are the frequencies of the two chirps.

In order to better appreciate the validity of the proposed IHTbased approach, a comparison with the performance of EMD,
MESA, and PASED, for noise-free and noisy signals, was carried out. As a reference implementation for EMD the Rillings
algorithm [30] was used, which is, to the best of the authors
knowledge, one of the best optimized implementations available. MESA was implemented by means of an ad hoc Gaussian
filter bank followed by the standard DESA demodulator borrowed from the PASED algorithm. Finally, the implementation

SNR

(46)

where is the index of the extracted component that corresponds


, are reported in Table I for the four
to the component
AMFM methods IHT, EMD, MESA, and PASED, as functions
of the amplitude . The same comparison was carried out for
noise-corrupted signals, obtained by adding a Gaussian noise
, that is, a noise power of one tenth of the
with variance
signal power ( 10 dB), to the second component. These results
are shown in Table II.
As can be seen from Tables I and II, the IHT-based modeling
has higher performance than EMD, both in the extraction of AM

GIANFELICI et al.: MULTICOMPONENT AMFM REPRESENTATIONS

831

Fig. 7. SNR of AM demodulated components as a function of modulation index and amplitude ratio A =A for the three algorithms IHT, MESA, and PASED.

components, and in noise rejection. MESA and PASED have


a lower performance than IHT when one component is much
stronger than the other, in both noiseless and noisy signals.
Another series of tests using two-component AMFM signals
was considered in order to verify the FM demodulation capabilities and the influence of FM on AM component extraction. The
signals used are defined as
(47)
with
(48)

(49)
where
,
, , and are the same as in (41) and (42),
is the modulation index, and is the FM modulating freHz. Both the AM SNR and
quency, which was fixed at
the root mean square (rms) error of estimated IFs are shown in
Figs. 7 and 8, for IHT, MESA, and PASED.
It is worth noting that IHT has a better SNR in AM-extracted
components than that of the other techniques for every consid-

, except for the case where the ampliered modulation index


.
tudes of the two components are comparable, e.g.,
In fact, it is well known that energy-based methods work better
in this case. Analogously, IHT has a lower rms error in the estimation of the IFs of FM components, apart from the case of
comparable amplitudes as previously stated.
Finally, Fig. 9 shows some direct comparisons between the
two sequential iterative algorithms. As it turns out, IHT performs better than EMD for computation time, with a processing
time ratio between EMD and IHT that increases with signal
length. Moreover, Fig. 10, where the residual energy is reported
as a function of the number of iterations, clearly shows that the
asymptotical convergence of IHT is faster than that of EMD, regardless of the number of components.
VI. MODEL APPLICATION TO SPEECH SIGNALS
This section presents a few applications of the IHT algorithm
to speech signals of arbitrary length. The signals used are part
of the Italian portion of the Multext Prosodic Database [32],
which is an extract of the EUROM.1 speech corpus [33] and
contains utterances from ten Italian speakers of different sex,
age, and geographical origin, who recorded 15 sentences each
in an anechoic room, amounting to nearly 7000 words.
and
Fig. 11 shows elementary amplitude envelopes
as obtained applying the IHT algorithm with
phases

832

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 3, MARCH 2007

Fig. 8. RMS error of demodulated IF components as a function of modulation index and amplitude ratio A =A for the three algorithms IHT, MESA, and PASED.

Fig. 10. Residual energy comparison. IHT versus EMD.


Fig. 9. Processing times of the IHT versus EMD.

to the Italian word settimana (which is pronounced


/settima:na/ and means week). The phases vary slowly, and
their slopes appear quite similar, but it is easy to note that their

derivatives (whose mean values, in isolated vowels, represent


the speech formants) generate different center frequencies.
In order to test the validity of mean-IF extraction from the
slowly-varying phases, several isolated vowels extracted from

GIANFELICI et al.: MULTICOMPONENT AMFM REPRESENTATIONS

833

Fig. 13. Relative energy of the residual kr (t)k=kx(t)k as a function of the


iteration number N .
Fig. 11. Italian word settimana. (a) Original signal. (b) Its elementary amplitude envelopes a (t). (c) Phases (t). (j = 0; . . . ; 4).

Fig. 12. Results of the adaptive segmentation algorithm applied to (t). Vertical bars are interval boundaries.

the Multext Prosodic Database, were considered. Experimental


results show a good accuracy in formant estimation.
Fig. 12 shows the results of the adaptive segmentation algorithm applied to the first extracted phase. This figure clearly
shows that the required accuracy determines a highly irregular
segmentation of the time axis, in contrast with what would have
happened with an a priori segmentation.
The asymptotical exactness of the proposed method arises
from (33), but Fig. 13 empirically shows that the convergence
is reached even with low values of . Here, the relative energy
is plotted as a function of the
of the residual
iteration number and it can be seen that 20 iterations suffice
to give an error comparable to round-off noise. Moreover, on the
basis of a subjective listening test performed with 40 people, it
, the model can be deemed
is possible to state that, with

equivalent to the original signal, with a relative residual-error


in the order of 30 dB.
In order to clarify the relation between mathematical components and their physical meanings, the time-frequency analysis
based on spectrograms is shown in Figs. 14 and 15. In particular,
Fig. 14 shows the results for the Italian word settimana, where
it is easy to note that the conformation of the time-frequency
structure can be effectively approximated with a small number
of AMFM components, which progressively perfect the spectrogram reconstruction. Nevertheless, the high-frequency components are not well-reconstructed because of the heterodyning
effects of IHT.
It is worth noting that signals segmented at word level do not
have a direct and simple connection with speech resonances and
formants as happens for example, in simple speech signals (such
as vowels, etc.), because of complex coarticulation phenomena
in speech production and phonation. Bearing in mind this consideration, the time-frequency analysis based on spectrograms
was performed with a simpler signal to validate the applicability
in speech-signal modeling. Fig. 15 depicts the spectrograms of
the Italian sustained vowel /a/. As happened in the above case,
the progressive extraction of components perfects the spectrogram structure, and the heterodyning effect of IHT causes higher
power components to be reconstructed before the lower power
ones. Moreover, in this case it is clear that the formant structure
of the vowel is captured after the first few components.
Experimental results show the absence of limitations in
terms of signal length, time-frequency distribution, and so on.
Additionally, the power spectral density (PSD) of the Italian
vowel /e/ and of the Italian word settimana, rebuilt with a
varying number of components, was considered. The Euclidean
distances between the aforementioned approximations and the
original PSDs, are shown in Figs. 16 and 17, thus empirically
verifying the convergence. It is worth noting that a theoretical
proof of the PSD convergence can easily be obtained by means
of the Parseval equality and the property of asymptotical IHT
convergence.

834

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 3, MARCH 2007

Fig. 14. Italian word settimana. (a) Spectrogram. (b)(f) Spectrograms of rebuilt signals with N

VII. CONCLUSION

= ;...;

5, respectively.

is defined as in (5),
and
. Thanks to the filtering described in Section III-A, (50)
can be rewritten as

where

This paper presents an asymptotically exact multicomponent


sinusoidal model that can be applied to implement an AMFM
decomposition of speech signals. The proposed approach is
based on the iterated application of the Hilbert transform to
amplitude envelopes obtained by adaptively low-pass filtering
the Gabor signal amplitudes. Instantaneous frequencies were
then obtained from the extracted phases by means of a simple
linear regression over time intervals adaptively detected a
posteriori.
Applications of the algorithm to synthetic signals and natural
speech showed its effectiveness in both component extraction
and speech modeling. A comparative evaluation with state-ofthe-art techniques demonstrated the superiority of the proposed
approach, without the need for complex optimizations like those
required by other approaches.

(51)
. Then reapplying to
where
expressed in (50) we have

the equality
(52)

where

and

, thus obtaining

(53)
APPENDIX
where
According to (14), it is possible to write the signal

as
(50)

(54)
(55)

GIANFELICI et al.: MULTICOMPONENT AMFM REPRESENTATIONS

835

Fig. 15. Italian sustained vowel /a/. (a) Original signal. (b) Its spectrogram. (c)(f) spectrograms of rebuilt signals with N

Fig. 16. PSD distance of the Italian vowel /e/ and its rebuilt signals as a function
of the number of components.

= ;...;

4, respectively.

Fig. 17. PSD distance of Italian word settimana and its rebuilt signals as a
function of the number of components.

i.e.,
Subsequently,
obtain

is further decomposed by filtering so as to

(56)

(57)

836

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 3, MARCH 2007

Reapplying to

the formulation proposed in (52)

where
(58)

where

,
, and
. The generalization of (54), (55), and

(63)(66) is

and

(69)

, we have

(70)
and
for
previously proposed, we obtain

. Iterating the formulation

(59)
and expanding

(71)
(60)

which can be rewritten in the form

By reusing trigonometric formulas, it is possible to write

(72)
(61)

and regrouped

that is, in compact form

(73)
thus obtaining the formulation proposed in (22).
(62)
ACKNOWLEDGMENT
where
(63)
(64)
(65)
(66)

The authors would like to thank the associate editor and the
anonymous reviewers for their valuable comments that helped
improve this paper. They would also like to thank Prof. B. Santhanam for providing them with a reference implementation of
his PASED demodulation algorithm.
REFERENCES

With a further filtering operation we then have

(67)
and it is possible to generalize (52) and (58) as

(68)

[1] R. McAulay and T. Quatieri, Speech analysis/synthesis based on a sinusoidal representation, IEEE Trans. Acoust., Speech, Signal Process.,
vol. ASSP-34, no. 4, pp. 744754, Aug. 1986.
[2] P. Maragos, J. F. Kaiser, and T. F. Quatieri, Energy separation in
signal modulations with application to speech analysis, IEEE Trans.
Speech Audio Process., vol. 41, no. 10, pp. 30243051, Oct. 1993.
[3] S. L. Hahn, Hilbert Transforms in Signal Processing. Boston, MA:
Artech House, 1996.
[4] M. Goodwin, Multiresolution sinusoidal modeling using adaptive segmentation, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.
(ICASSP98), Seattle, WA, May 1998, vol. 3, pp. 15251528.
[5] R. Boyer and K. Abed-Meraim, Audio modeling based on delayed
sinusoids, IEEE Trans. Speech Audio Process., vol. 12, no. 2, pp.
110120, Mar. 2004.
[6] J. Jensen, S. H. Jensen, and E. Hansen, Exponential sinusoidal modeling of transitional speech segments, in Proc. IEEE Int. Conf. Acoust.,
Speech, Signal Process. (ICASSP99), Phoenix, AZ, Mar. 1999, vol. 1,
pp. 473476.

GIANFELICI et al.: MULTICOMPONENT AMFM REPRESENTATIONS

[7] P. Lemmerling, I. Dologlou, and S. Van Huffel, Speech compression


based on exact modeling and structured total least norm optimization,
in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP98),
Seattle, WA, May 1998, vol. 1, pp. 353356.
[8] S. Van Huffel, H. Park, and J. B. Rosen, Formulation and solution of
structured total least norm problems for parameter estimation, IEEE
Trans. Signal Process., vol. 44, no. 10, pp. 24642474, Oct. 1996.
[9] K. Hermus, W. Verhelst, P. Lemmerling, P. Wambacq, and S. Van
Huffel, Perceptual audio modeling with exponentially damped sinusoids, Signal Process., vol. 85, no. 1, pp. 163176, Jan. 2005.
[10] R. Boyer and K. Abed-Meraim, Estimation of damped and delayed
sinusoids: Algorithm and CramerRao bound, in Proc. IEEE Int. Conf.
Acoust., Speech, Signal Process. (ICASSP03), Hong Kong, Apr. 2003,
vol. 6, pp. 137140.
[11] , Audio transients modeling by damped and delayed sinusoids
(DDS), in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.
(ICASSP02), Orlando, FL, May 2002, vol. 2, pp. 17291732.
[12] B. Santhanam and P. Maragos, Multicomponent AMFM demodulation via periodicity-based algebraic separation and energy-based demodulation, IEEE Trans. Commun., vol. 48, no. 3, pp. 473490, Mar.
2000.
[13] S. Gazor and R. R. Far, Adaptive maximum windowed likelihood
multicomponent AMFM signal decomposition, IEEE Trans. Audio,
Speech, Lang. Process., vol. 14, no. 2, pp. 479491, Mar. 2006.
[14] N. E. Huang, Z. Shen, S. R. Long, M. C. Wu, H. H. Shih, Q. Zheng,
N.-C. Yen, C. C. Tung, and H. H. Liu, The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis, Proc. R. Soc. London A, vol. 454, no. 1971, pp. 903995,
Mar. 1998.
[15] T.-H. Li and B. Kedem, Iterative filtering for multiple frequency estimation, IEEE Trans. Signal Process., vol. 42, no. 5, pp. 11201132,
May 1994.
[16] P. Maragos, J. F. Kaiser, and T. F. Quatieri, On separating amplitude
from frequency modulations using energy operators, in Proc. IEEE
Int. Conf. Acoust., Speech, Signal Process. (ICASSP92), San Francisco, CA, Mar. 1992, vol. 2, pp. 14.
[17] B. Santhanam, Generalized energy demodulation for large frequency
deviations and wideband signals, IEEE Signal Process. Lett., vol. 11,
no. 3, pp. 341344, Mar. 2004.
[18] A. Potamianos and P. Maragos, A comparison of energy operators and
the Hilbert transform approach to signal and speech demodulation,
Signal Process., vol. 37, no. 1, pp. 95120, May 1994.
[19] Y. Bar-Ness, F. Cassara, H. Schachter, and R. DiFazio, Cross-coupled
phase-locked loop with closed loop amplitude control, IEEE Trans.
Commun., vol. COM-32, no. 2, pp. 195199, Feb. 1984.
[20] D. OShaughnessy, Speech Communications: Human and Machine,
2nd ed. Piscataway, NJ: IEEE Press, 2000.
[21] F. Gianfelici, G. Biagetti, P. Crippa, and C. Turchetti, Asymptotically exact AMFM decomposition based on iterated Hilbert transform, in Proc. Interspeech2005Eurospeech9th Eur. Conf. Speech
Commun. Technol., Lisbon, Portugal, Sep. 2005, pp. 11211124.
[22] P. J. Brockwell and R. A. Davis, Times Series: Theory and Methods.
New York: Springer-Verlag, 1991.
[23] J. S. Marques and L. B. Almeida, Frequency-varying modeling of
speech, IEEE Trans. Acoust., Speech, Signal Process., vol. 39, no. 5,
pp. 763765, May 1989.
[24] A. Potamianos and P. Maragos, Speech analysis and synthesis using
an AMFM modulation model, Speech Commun., vol. 28, no. 3, pp.
195209, Jul. 1999.
[25] . . Etemoglu and V. Cuperman, Matching pursuits sinusoidal
speech coding, IEEE Trans. Speech Audio Process., vol. 11, no. 5,
pp. 413424, Sep. 2003.
[26] T. Abe and M. Honda, Sinusoidal model based on instantaneous frequency attractors, in Proc. IEEE Int. Conf. Acoust., Speech, Signal
Process. (ICASSP03), Hong Kong, Apr. 2003, vol. 6, pp. 133136.
[27] P. Prandoni, M. Goodwin, and M. Vetterli, Optimal time segmentation
for signal modeling and compression, in Proc. IEEE Int. Conf. Acoust.,
Speech, Signal Process. (ICASSP97), Munich, Germany, Apr. 1997,
vol. 3, pp. 20292032.
[28] M. M. Goodwin and J. Laroche, Audio segmentation by feature-space
clustering using linear discriminant analysis and dynamic programming, in IEEE Workshop Applicat. Signal Process. Audio Acoust.,
New Paltz, NY, Oct. 2003, vol. 1, pp. 131134.
[29] R. Vafin, R. Heusdens, S. van de Par, and W. B. Kleijn, Improved
modeling of audio signals by modifying transient locations, in IEEE
Workshop on Applicat. Signal Process. Audio Acoust., New Paltz, NY,
Oct. 2001, vol. 1, pp. 143146.
[30] G. Rilling, P. Flandrin, and P. Gonalvs, On empirical mode decomposition and its algorithms, in Proc. IEEE EURASIP Workshop Nonlinear Signal Image Process., Grado, Italy, Jun. 2003.
[31] A. C. Bovik, P. Maragos, and T. F. Quatieri, AMFM energy detection
and separation in noise using multiband energy operators, IEEE Trans.
Signal Process., vol. 41, no. 12, pp. 32453265, Dec. 1993.

837

[32] E. Campione and J. Vronis, A multilingual prosodic database, in


Proc. 5th Int. Conf. Spoken Lang. Process. (ICSLP98), Sydney, Australia, Dec. 1998, vol. 7, pp. 31633166.
[33] D. Chan, A. Fourcin, D. Gibbon, B. Grandstrm, M. Huckvale,
G. Kokkinakis, K. Kvale, L. Lamel, B. Lindberg, A. Moreno,
J. Mouropoulos, F. Senia, I. Trancoso, C. Veld, and J. Zeiliger,
EUROMA spoken language resource for the EU, in Proc. ESCA,
4th Eur. Conf. Speech Commun. Technol. (Eurospeech95), Madrid,
Spain, Sep. 1995, vol. 1, pp. 867870.

Francesco Gianfelici was born in 1979. He received


the Laurea degree in electronics engineering from the
Universit Politecnica delle Marche, Ancona, Italy,
in 2003. He is currently pursuing the Ph.D. degree
in electronics, informatics, and telecommunications
engineering in the Dipartimento di Elettronica, Intelligenza Artificiale e Telecomunicazioni (DEIT), Universit Politecnica delle Marche.
He has been active in the areas of theoretical
computer science and information theory. His
current research interests include multicomponent
speech modeling based on AMFM parameters, signal and image processing,
recognition algorithms, and neural networks.

Giorgio Biagetti (S03M05) received the Laurea


degree (summa cum laude) in electronics engineering
from the Universit degli Studi di Ancona, Ancona,
Italy, in 2000, and the Ph.D. degree in electronics and
telecommunications engineering from the Universit
Politecnica delle Marche, Ancona, in 2004.
He is currently a Research Assistant at the Dipartimento di Elettronica, Intelligenza Artificiale e
Telecomunicazioni (DEIT), Universit Politecnica
delle Marche. His research interests include statistical and high-level simulation of analog integrated
circuits, statistical modeling, coding, and synthesis of speech, and automatic
speech recognition.

Paolo Crippa (M02) received the Laurea degree in


electronics engineering (summa cum laude) from the
Universit degli Studi di Ancona, Ancona, Italy, in
1994 and the Ph.D. degree in electronics engineering
from the Polytechnic of Bari, Bari, Italy, in 1999.
From 1994 to 1999, he was Research Fellow
at the Department of Electronics, Universit degli
Studi di Ancona, where in 1999 he was appointed
Research Assistant as a member of the Technical
Staff. Since 2006, he has been with the Dipartimento
di Elettronica, Intelligenza Artificiale e Telecomunicazioni (DEIT), Universit Politecnica delle Marche, Ancona, as an Assistant
Professor. His current research interests include statistical modeling and
simulation of integrated circuits, mixed-signal and RF circuit design, neural
networks, and areas of signal processing involving coding, synthesis, and
automatic recognition of speech.

Claudio Turchetti (M86) received the Laurea degree in electronics engineering from the Universit
degli Studi di Ancona, Ancona, Italy, in 1979.
He joined the Department of Electronics, Universit degli Studi di Ancona in 1980. He is currently a
Full Professor of applied electronics and integrated
circuits design and the Head of the Dipartimento di
Elettronica, Intelligenza Artificiale e Telecomunicazioni (DEIT), Universit Politecnica delle Marche,
Ancona. He has been active in the areas of device
modeling, circuits simulation at the device level,
and design of integrated circuits. His current research interests are also in
analog neural networks, statistical analysis of integrated circuits for parametric
yield optimization, statistical modeling, coding, and synthesis of speech, and
automatic speech recognition.

Das könnte Ihnen auch gefallen