Sie sind auf Seite 1von 109

1

University of Miami
An Adaptive Time-Frequency Representation with
Re-Synthesis Using Spectral Interpolation
By
Abhijeet Tambe
A Research Project
Submitted to the faculty of the University of Miami in partial fulfillment of the
requirements for the degree of Master of Science in Music Engineering
Coral Gables, Florida
April, 2000
2
Abhijeet Tambe (Master of Science, Music Engineering Technology)
An Adaptive Time-Frequency Representation with Re-Synthesis Using Spectral
Interpolation
Abstract of Masters Research Project at the University of Miami
Research project supervised by William Pirkle
Abstract: The conventional method for storing and transmitting digital audio is with
PCM samples in the time domain. An adaptive time-frequency analysis is performed on
the audio signal and it is stored in the form of a 3-D matrix containing time, frequency
and magnitude information. The validity of this method is tested by using a spectral
interpolation algorithm to perform re-synthesis in order to compare the synthesized
signal to the original. Among the many applications for such a representation, the
possibility of audio compression is explored. The results of applying this procedure on a
few audio signals are plotted along with the results of listening tests.
3
Acknowledgements
I would like to thank the members of my thesis committee for their encouragement and support.
Will Pirkle in particular, who was my main advisor, was patient with my repeated changes of
focus in this study and always had great ideas to offer. Thank you all for everything.
I would like to acknowledge and thank my family members, who were with me every step of the
way. Ashwini, thanks for the midday phone calls and Baba, thanks for keeping the faith. Shankar
thanks machcha. Mama, Mami, Azoba, Aparna, Atul, Kaka, Kaku, Atya, Shruti, Ganesh and of
course Vaini always encouraged my scholarly efforts. Thank you all for the support.
The Dreggs are the next group of people to thank on my list. Ajay, your wisdom is beyond your
years. Archana, your dissertation was the guiding light for this work. J atin, you are The One. Ex-
commander Vishal, I hope your commandership is restored to you someday. Ambu, your emails
are the best. Srinaths psychotic work schedule was a constant reminder of how my life is o.k.
Santosh, I hope you find what you are looking for. Seetha, I wish you happiness. Bansi, I hope
youre sending it. Raj, I dont know what to say.
Last and definitely not least are my friends in Miami. Kimbo, I wont forget your contribution to
society and Lausims in general. Ralph, we did it man, we finished our theses! Ryan, you are the
best, homeboy. Mathilde, Im freezing. Abby, Im hooooooome! Nele, I hope you swim with the
dolphins. Ashwanth, you have to SENNNND IT. Siddarth, you shall execute the perfect cover
drive. Karthik, you are also the best, homeboy. Alex, youre crazy man, dont change. Ali, thanks
for everything. Mike and Amy, you guys are so cool. J ames, thanks for introducing me to club
bridge. J ay you are my mentor for life. Daisuke, you shall design the ultimate loudspeaker.
Margarita, thanks for introducing me to Colombian food. Anna Maria, thanks for attempting to
teach me how to dance Salsa. Anu, thanks for the monitor. Sylvia, may you be Dr. Sylvia very
soon. J asmin, you already know everything, I dont know what to say. Thanks Paul Griffith for
introducing me to concert recording. Thanks Kai for doing the electronic parts in that cool song
we recorded. Chris youre a great drummer, I hope you shine. Naseer, thanks for the Isuzu
Trooper. Lorna and Shayne, Im glad I met you. Thank you Titanic for some memorable
Wednesday nights. Peggy you are the best. Zammy Zammy, Benny-looking-maan. Thanks all of
you, I hope we stay in touch.
4
Table of Contents
Chapter 1 Introduction 5
Chapter 2 Analysis/Synthesis Techniques 7
2.1 Digital signals and the sampling theorem 7
2.2 Frequency domain transformation 8
2.3 Time/Frequency analysis 19
2.4 Filter banks 22
2.5 Synthesis techniques 27
2.6 Classical and modern studies of timbre 30
2.7 The Proposed Scheme 31
Chapter 3 Psychoacoustics 33
3.1 Auditory analysis 33
3.2 Frequency resolution of the human ear 34
3.3 Masking 36
3.4 Critical bands 38
Chapter 4 Non orthogonal signal decomposition 40
4.1 Theory and Computation 40
4.2 Summary 43
Chapter 5 Adaptive Time/Frequency Analysis 44
5.1 Why Adaptation? 44
5.2 Adaptation 49
5.3 Improved Adaptation Algorithm 58
5.4 Time/Frequency Representation 73
Chapter 6 Synthesis Using Spectral Interpolation 74
6.1 Why Spectral Interpolation? 74
6.2 Frame to Frame Peak Matching 81
6.3 Synthesis Using Cubic Phase Interpolation 87
Chapter 7 Results and Conclusions 94
7.1 Results for Classical Music 95
7.2 Results for Guitar Chord 97
7.3 Results for Clarinet Patch 98
7.4 Results for Speech 99
7.5 Listening Tests 101
7.6 Conclusions 105
7.7 Improvements 106
7.8 Summary 107
5
Chapter 1 - Introduction
Digital audio data is generally stored in the form of digital samples in time. These
samples are obtained by sampling the continuous analog signal ( ) t f in time at equal time
intervals T, to obtain the digitized version of the signal ( ) nT f in time. The digital signal
is stored as a sequence of samples in time and can be considered as a discrete-time signal.
The signal stored is a function of time, and is said to be in the time domain. It is also
possible to view the signal in some other domain by applying an appropriate transform on
it. As an example, the Fourier transformation, if applied on a function of time, results in a
function of frequency thereby enabling a frequency domain view of the signal. In general,
if the transform is chosen properly the representation of the signal in the other domain is
complete and no information is lost. The advantage of representing the signal in some
other domain is that it reveals other information inherent in the signal that might not be
obvious in the time domain.
For viewing the signal in some other domain, an analysis has to be performed on the
signal to extract the critical components required for this alternate representation. After
this is done, the necessary signal processing functions are performed on this transformed
signal. In order to convert the signal back into the time domain, a re-synthesis has to be
performed. This process is generally an inverse of the analysis procedure. This sequence
is shown in figure 1.1.
) (n f ) (m F ) (m F
n
) (n f
n
Figure 1.1 General analysis/synthesis model for audio signals
Examples of this analysis/synthesis method include compression techniques such as
MPEG (Motion Picture Expert Group), which achieve data reduction by finding and
Analysis Signal Processing
functions
Synthesis
6
discarding redundancies. With respect to figure 1.1, in the case of some MPEG layers, the
analysis block involves using filter banks and later transforming the signal into the
frequency domain. The signal processing function is the process of finding redundancies
in the frequency domain and discarding them to reduce data. This work is usually done
on the encoder side and at the end of this process, the signal is represented at a much-
reduced bit-rate. The synthesis involves using an inverse transform to reconstruct the
signal in the time domain. This is done by the decoder, which simply reconstructs the
signal from its coded format and plays it back.
Any audio signal consists of frequency components whose magnitudes and actual
frequencies vary dynamically with time. An alternate representation of the same signal,
which tracks these frequency components and shows how they vary with time, could be
very useful. This could be used for applications ranging from audio compression to pitch
shifting as well as time compression and expansion. The motivation for this study is to
develop such an alternate time-frequency representation for audio signals. The validity of
this method depends on whether it encodes enough information to reconstruct a
synthesized signal that is very similar in quality to the original signal. A synthesis method
is developed to reconstruct the time signal based on the new representation, so that the
original signal and the synthesized version can be compared. From an applications
standpoint, the possibilities for audio compression using this method are explored.
Broadly this study can be classified into four sections. The first is the theory section,
which is presented in chapters 2 and 3. These chapters develop the relevant theory, which
is used, in later chapters. The second section deals with the analysis and time-frequency
representation of the signal, and is covered in chapters 4 and 5. The third section deals
with the re-synthesis of the time signal from this new representation and is described in
chapter 6. The fourth and final section contains the results and conclusions, which are
documented in chapter 7.
7
Chapter 2 Analysis/Synthesis Techniques
This chapter starts with a review of digital audio signals and the Fourier analysis method
for transforming signals into the frequency domain. Current time-frequency analysis
methods are described along with their limitations. Various synthesis techniques based on
analysis are then discussed followed by a review of classical and modern studies of
timbre, which clarifies why the proposed scheme is necessary. The chapter ends with a
brief description of the proposed scheme.
2.1 DIGITAL SIGNALS AND THE SAMPLING THEOREM
Digitizing a signal is the process of converting a continuous (analog) signal ( ) t f into a
discrete-time signal ( ) nT f by sampling the continuous signal at equal intervals of time
T. The sampling frequency
s
F is the inverse of the sampling period T . According to the
sampling theorem, no information is lost in this conversion as long as the continuous
signal is band-limited and the sampling frequency
s
F is at least twice the signal
bandwidth. The term band refers to a band of frequencies. The term band-limited implies
that the range of frequencies is limited and the term bandwidth refers to the actual range
of frequencies present in the signal. If this condition is not met, an error known as
aliasing occurs. After this conversion, the signal can be represented and stored as a
sequence of sample values as shown in figure 2.1. The number of samples stored per
second is the same as the sampling rate. Quantization is a process where each sample is
given a discrete amplitude value and is represented by a digital word, which has a
certain number of bits. The higher the number of bits used, the lower the noise in the
signal.
8
Figure 2.1.1 Analog signal Digital signal conversion
Sampling rates in audio are measured in samples per second or Hertz. Common sampling
rates in audio are 8 kHz, 11.025 kHz, 16 kHz, 20 kHz, 32 kHz, 44.1 kHz, 48 kHz, 96
kHz, and 192 kHz. The audio signal must be band limited to half the sampling frequency
before sampling to prevent aliasing distortion. Some common word lengths in digital
audio are 8 bits, 16 bits, 20 bits, and 24 bits. Reducing the sampling rate or the number of
bits is equivalent to reducing the amount of data required to store the signal but also
implies that the digital representation will not be as accurate.
2.2 FREQUENCY DOMAIN TRANSFORMATION
In order to analyze the behavior of a signal, it is often necessary to view it in the
frequency domain. The most well known frequency domain transformation is the Fourier
Transform. The Fourier transformation and its properties in a purely analog sense are
now discussed followed by another discussion on the implications of using it in the
digital domain. The Fourier transformation is defined as:

dt e t f jw F
jwt
) ( ) ( (2.2.1)
Here ) (t f is the signal as a function of time in the time domain and ) ( jw F is the signal
as a function of complex frequency in the frequency domain. Since ) ( jw F is complex, it
9
has both magnitude and phase and these can be plotted separately. This is also known as
the spectrum of the signal since it reveals information about the frequency characteristics
of the signal. The Fourier Transform shown in equation 2.2.1 has limits between - and
+. This implies that the signal goes on for all time. In practice, this is neither possible
nor convenient for normal calculations, which involve analyzing finite length signals. In
fact, small portions of the signal during a certain frame in time are usually extracted for
analysis. This means that the signal being analyzing is time-limited and the limits of
integration can be changed (from + to - ) to the time limits of the window being
analyzed. This signal truncation causes a smearing or spreading effect in the spectrum of
the windowed signal and this is called windowing effect. If a small part of the signal is
extracted over some frame in time, the extracted portion is equivalent to multiplying the
actual signal with a rectangular window function whose value is 1 during the portion of
time corresponding to the extracted signal and zero at all other times. In general, the
signal can be multiplied with a number of different window functions and this is given by
the following equation:
( ) ( ) ( ) t w t f t f
w
(2.2.2)
( ) ( ) ( ) jw W jw F jw F
w
* (2.2.3)
Here, f(t) is the original signal, w(t) is the window function and ) (t f
w
is the windowed
signal. It can be shown that using a multiplication operator in the time domain is
equivalent to using a convolution operator in the frequency domain and vice versa. Using
this result, it can be seen that the spectrum of the windowed signal is equal to the
spectrum of the original signal convolved with the spectrum of the window function. This
is significant because it implies that the spectrum of the window function plays a very
important role in the spectrum of the windowed signal. In order to analyze the effects of
the term W(jw) in equation 2.2.3, consider the case of a rectangular window.
( ) ( ) ( ) 2 / 2 /
1 1 w w
T t U T t U t w +

(2.2.4)
10
) (
1
x t U +

is a unit step signal with a step at time t=- x and


w
T is the width of the window
in time. If it is assumed that the window is centered around t=0,
( ) ( )
( )
2 /
2 / sin
. } {
w
w
w
wT
wT
T t w Fourier jw W (2.2.5)
The spectrum W(jw) of the rectangular window can be shown to be in the form of
( ) x x / sin . This is otherwise known as the sinc function and is shown in figure 2.2.1a. The
largest lobe is around the center and is known as the main-lobe, and the smaller lobes on
both sides of it are known as side-lobes. There are various windows that can be used
other than the rectangular window such as the Hamming, Hann, and Blackmann
windows. Whereas the rectangular window puts an equal weight on all parts of the signal
in the window (with its rectangular shape), these other windows apply different weights
at different points in time during the window. Typically, the shape of a window is based
on some commonly known functions such as the cos or
2
cos function. The spectra of all
the window functions tend to resemble the sinc function. They basically differ in terms of
the width of the main-lobe and suppression of side-lobes.
Ideally, it is desirable to have a window whose spectrum is as close to an impulse as
possible, since convolving with an impulse results in the same original spectrum.
Unfortunately, a window with an impulse in the frequency domain is equivalent to a
window with a value 1 for all time in the time domain, which in turn is equivalent to
considering the signal for all time, which cannot be done. The moment a finite length
signal is considered for analysis, it means implicitly that the window is of finite length
and therefore its spectrum is non-impulsive. This in turn means that the true spectrum
of the signal is never seen due to the fact that the true spectrum is convolved with the
non-impulsive window spectrum. The true spectrum refers to a spectrum where each
frequency that is present in the signal is seen as an individual impulse in the frequency
domain. The convolution operation centers the window spectrum at each of these
frequencies and the result of the operation is the sum of these centered window spectra.
11
In the case of a rectangular window, the window spectrum is a sinc function as
demonstrated earlier. Since the window spectrum has a non-zero bandwidth, there could
be some overlap between these window spectra centered at various frequencies. This
results in addition and cancellation of adjacent spectra, thus giving rise to a distorted
version of the original spectrum.
In many applications, it is important to be able to extract the actual frequencies present in
the signal during any given time frame. For a complex audio signal with multiple
frequencies at any given time, windowing results in a smeared spectrum in which it is
difficult to extract the exact frequencies present during that time frame. In this case, these
frequencies can be found by extracting the peaks in the frequency spectrum. In order to
do this successfully it is desirable to have peaks that are as well defined as possible and
can be extracted using a peak detection algorithm. So, choosing the right window
function becomes vital.
There are some factors to be considered before choosing a window function. For a
window of length
w
T seconds, the frequency separation
z
f between the zero crossings of
adjacent side lobes is
w
T / 1 Hz for most windows. It is preferable to have the main lobe
width be as small as possible and the side lobes as suppressed as possible to approximate
an impulse spectrum. The width of the main lobe is
z
f . 2 for a rectangular window, which
is relatively small, but the side lobe suppression is very poor. In fact, the maxima of the
first side lobe is only 13dB (decibels) below the maxima of the main lobe. Better side
lobe suppression can be achieved at the cost of a wider main lobe by using other types of
windows. For example, a Hamming window has side lobe suppression of 43dB and more,
but the main lobe is twice as wide as that of a rectangular window. Choosing a suitable
window becomes a trade off between the width of the main lobe and suppression of the
side lobes and depends on the application. In general, the Hamming window tends to be
popular due to its high side lobe suppression and relatively narrow main lobe.
Consider what happens when the Fourier transform is used on a digitized signal. A digital
signal consists of a sequence of discrete values, which result from sampling the signal as
12
discussed in chapter 2.1. Equation 2.2.1 refers to a continuous signal. In order to apply a
similar transform on a discrete signal ) (nT f :
( ) ( )
jwnT
N
N
e nT f jw F

+

1 2 /
2 /
. (2.2.6)
Equation 2.2.6 is a used for decomposing an N point signal into its frequency
components. Since it is complex, it provides magnitude as well as phase information at
each frequency. It is seen from equations 2.2.6 (and equation 2.2.1 for that matter) that
the Fourier Transform is basically a correlation function which correlates the signal
( ) nT f (or ) (t f ) with a sinusoid of a given frequency w by multiplying the two signals
and finding the total energy in the output signal. When this is done for a set of
frequencies, ) ( jw F corresponding to each of those frequencies can be found using
equation 2.2.6 by substituting each of those frequencies for w and finding the result. If
an infinite number of frequencies are used to decompose a time-limited signal, the
Discrete Time Fourier Transform (DTFT) is obtained, which is a continuous function as
shown in figure 2.2.1.
In the digital domain, there are no continuous functions. Only discrete samples can be
stored. So, this spectrum (DTFT) needs to be sampled at certain frequency values. The
question is how often this spectrum needs to be sampled, so that the signal can be fully
represented. It can be shown that for an N point signal, this spectrum needs to be sampled
only at N equally spaced frequencies between 0 Hz and
s
F Hz to fully represent the
signal and satisfy the condition known as perfect reconstruction (Rabiner & Schafer,
1978). When N equally spaced frequencies between 0 and
s
F are used to decompose an
N point signal, the Discrete Fourier Transform (DFT) is obtained, which is the sampled
version of the DTFT. The frequencies at which the spectrum is sampled are known as
frequency bins. Any time limited signal that is a pure sinusoid has a spectrum that is of a
sinc function nature, with its main lobe centered at its actual frequency. If an N-point
DFT analysis is performed on this signal, the frequency bins are fixed at N n F
s
/ . ,
13
1 ,..., 0 N n . Varying spectra from the DFT analysis can be seen depending on how
close the actual frequency of the signal is to a frequency bin.
Figure 2.2.1 DTFT of a rectangular pulse
(Taken from Proakis & Manolakis, 1988)
As an example to demonstrate this, consider figure 2.2.2. All three figures show the
continuous DTFT function as well as the sampled points, which give the DFT of a pure
sine wave with a rectangular window applied on it. All three figures show the sinc
characteristics of the spectrum (DTFT). In the discrete world, where the spectrum is
calculated only for certain frequency bins, an interesting thing happens. If the frequency
of the sinusoid happens to line up exactly with a frequency bin, all the zero crossings of
the sinc function line up exactly with frequency bins. The result is that significant energy
is seen at the bin corresponding to the frequency of the signal and all other bins appear to
have zero energy. This is shown in figure 2.2.2.a. On the other hand, if the frequency
does not line up with a bin, the energy seen in the other bins is as shown in figure 2.2.2.b.
This phenomenon is known as spectral spreading and figure 2.2.2.c shows the worst case
of spectral spreading where the frequency of the sine wave is exactly halfway between
two bins.
14
Though in all three cases the discrete spectrum is accurate in the sense that the condition
for perfect reconstruction is met, for the purposes of analysis, the case shown in figure
2.2.4a is the most useful since it appears as though there is only one frequency present in
the signal. Additionally, the peak frequency bin in this case corresponds exactly to the
frequency of the signal. The case shown in figure 2.2.4c can lead to confusion during
analysis since it appears that there are other frequencies present in the spectrum. Besides
this, the frequency bin corresponding to the highest energy does not correspond exactly to
the actual frequency of the signal because the actual frequency is somewhere between
two adjacent bins.
Figure 2.2.2(a) DFT of windowed signal with frequency centered at a bin, (b) DFT of
windowed signal with frequency not on a bin and (c) DFT of windowed signal with
frequency halfway between 2 bins
(Taken from Lindquist, 1989)
The issue of breaking up a signal into its sinusoidal components using the DFT has been
discussed above. In this case, the basis functions (which are the types of components that
the signal is being broken up into) are sinusoidal. There are other discrete transforms that
can be used, which have a different set of basis functions (e.g., square waves), and break
15
up the signal into a different set of fundamental components (e.g., square waves). In
general, discrete transforms have the form
( ) ( ) ( ) n f n m K m F
N
n

1
, , m=1,, M
(2.2.7)
( ) ( ) ( )

M
m
m F m n K n f
1
1
, , n=1, , N
(Taken from Lindquist, 1989)
Here, ( ) n m K , is known as the kernel and ( ) n m K ,
1
is known as the inverse kernel or the
set of basis functions which constitute the transform. If the kernel set is complete (for
perfect reconstruction), any ( ) n f can be expressed as the linear sum of the weighted
basis functions. This is shown in matrix form in equation 2.2.8. Here, the K and J
matrices are inverses. Equation 2.2.7 and equation 2.2.8 are equivalent.
(2.2.8)
In the case of the DFT, the kernel set is complete if an N point signal is decomposed with
a set of at least N frequencies equally spaced between DC and
s
F . The DFT can
otherwise be expressed as:
( ) ( )
N nm j
N
N n
e n f m F
/ 2
1 2 /
2 /
.

+

16
( )
nm
N
N
N n
W n f .
1 2 /
2 /

+

(2.2.9)
where
N j
N
e W
/ 2

This can be shown in matrix form in equation 2.2.10. In equation 2.2.10, only the power
or exponent nm of
nm
M
W appears in the transformation matrix or the kernel.
(2.2.10)
It can be seen that each row of the transformation matrix corresponds to a particular
frequency. Also, to decompose an N point signal into N frequencies,
2
N complex
multiplies are required. Because of certain symmetry that the DFT matrix exhibits when
N equally spaced frequencies are chosen, the resulting transformation matrix has many
redundancies and by using decimation techniques, the number of complex multiplies can
be reduced to N N
2
log . This implementation is known as the Fast Fourier Transform
(FFT) and is usually the method used to calculate the DFT of a signal in commonly
available software. The FFT can be calculated relatively quickly, in real time on todays
digital signal processing hardware.
The FFT however has the same disadvantages as the DFT. The signals used to
decompose the signal are a function of the sampling frequency as well as the number of
frequency bins desired. Without modifying these two, the frequencies are not selectable.
A simple sine wave whose frequency does not fall on one of the frequency bins will
produce a spectrum with energy spread to many frequencies. This phenomenon is known
as spectral spreading and was discussed earlier. Figure 2.2.3 illustrates this point. Figure
17
2.2.3a shows the 1024 point FFT of a 500Hz sinusoid sampled at 8000 Hz. Since one of
the frequency bins used in this decomposition is exactly at 500Hz, the FFT shows almost
all its energy concentrated at that one bin. This is because all the zero crossings of the
sinc function in the spectrum line up exactly with frequency bins. On the other hand,
figure 2.2.3b shows the FFT of a sine wave with frequency 503.9 Hz. This frequency
happens to be exactly halfway between two frequency bins (500Hz and 507.8125 Hz)
with the result that the FFT shows significant energy present in many other bins. In a
strict sense, this is true because if sinusoids with the magnitudes and phases as indicated
by the FFT are generated and added up, the original signal is again obtained. But it would
be much more useful to know that there is a single frequency component at 503.9 Hz
rather than many frequency components at different bins as indicated by the FFT.
Figure 2.2.3 (a) FFT of 500Hz and (b) FFT of 503.9 Hz
Also, the FFT produces good results only for signals that tend to stay stationary. For
signals that vary rapidly with time, the FFT is not able to show the finer details of what is
happening within the analysis time frame. Figure 2.2.4a shows a typical audio signal in
the time domain and figure 2.2.4b shows its FFT. Although the FFT accurately shows
that most of the energy in the signal is present at lower frequencies, there are many finer
details in the time domain which it is unable to represent. The FFT only represents the
average behavior of each frequency in this time frame or window. For this reason, it is
common to study an audio signal using a series of short windows in time, and performing
18
Fourier analysis on each of them. This is known as a Short Term Fourier Transform
(STFT).
Figure 2.2.4 (a) Typical Audio signal in time and (b) its FFT
The difference between using a long window and a short window is a tradeoff between
good spectral resolution and good time resolution. This has been clearly explained by the
Uncertainty Principle (Cohen, 1995), which states the well-known mathematical fact that
a narrow waveform yields a wide spectrum and a long waveform yields a narrow
spectrum. Ackroyd states, if the effective bandwidth of a signal is W then the effective
duration cannot be less than about 1/W (and conversely). In fact, the time-bandwidth
product is a constant.
What this means in signal processing terms is that the smaller the duration of the window
used to analyze the signal, the larger the effective bandwidth of the section of the signal
under analysis. As an example, if a windowed portion of a pure sinusoid is under
analysis, it was shown earlier that its spectrum has a sinc function nature with the main
lobe centered at the frequency of the sinusoid. The widths of the main lobe and the side
lobes are inversely proportional to the length of the window. The longer the analysis
window in time, the narrower the lobes become and conversely, the shorter the analysis
window, the wider the lobes become. This is significant especially with respect to audio
signals, which have multiple frequencies present in a given time frame. The longer the
analysis window, the narrower the lobes for each separate frequency become thereby
19
decreasing the interaction between adjacent frequency peaks. So, with long analysis
windows, good spectral resolution is obtained with well-defined frequency peaks, but the
time-resolution is poor since the spectrum tends to reflect the averaged value over a long
period of time for each frequency. When short analysis windows are used to increase
time-resolution, the spectral resolution is decreased due to the larger lobe width leading
to increased interaction. In speech and audio processing, it is common to use window
lengths between 5ms and 30ms. The choice of the type of window used depends on how
important side lobe suppression is compared to narrow main lobe width, since this is
where the trade off lies for different types of windows.
2.3 TIME-FREQUENCY ANALYSIS
Audio signals typically vary with both time and frequency. In order to accurately assess
what the frequency components of the signal are, and also how they vary with time, it is
necessary to perform an analysis that shows a time-frequency distribution.
As seen in the previous section, the FFT of a relatively long signal in time tends to
average out the effects of various frequencies within that block. In other words, very little
information is obtained about the effect of frequency components that are changing
rapidly within that block of time. In order to reduce this effect the signal is divided into
smaller blocks in time and each of these is analyzed separately. The idea is to reduce the
size of the block to a small enough length in time such that frequency information within
that block is almost stationary. Once each block is analyzed in this way, the Fourier
transformations of each block in time can be lined up to get a better representation of the
time-frequency behavior of the signal. To demonstrate the advantage of using this
method, consider figure 2.3.1. Figure 2.3.1a shows a chirp signal, which is a sinusoid
whose frequency increases linearly in time. If a Fourier transform is performed on this
signal, the result is the spectrum shown in figure 2.3.1b.
20
Figure 2.3.1(a) Chirp Signal in Time and (b) its FFT
This shows a near flat spectrum and is the expected behavior for the Fourier transform
since it indicates that all frequencies have equal energy content within the signal. But it
does not indicate that the signal has a frequency increasing linearly in time. The spectrum
is actually similar to that of white noise as well as that of an impulse. In order to
understand how the analyzed signal is different from white noise or an impulse a time
frequency distribution as shown in figure 2.3.2 is desirable.
The distribution shown in figure 2.3.2 shows frequency plotted as a function of time, with
intensity represented in gray scale. This figure shows the ideal time-frequency
distribution of the chirp signal because it shows frequency increasing linearly with time
with equal intensity at all times. It is not possible to obtain such an ideal time-frequency
distribution using current analysis techniques. This is because the time-frequency
distribution shown above implies perfect resolution in both time and frequency and this
violates the uncertainty principle (Cohen, 1995). When using Short Time Fourier
Transforms to calculate the time-frequency distribution of a signal, it is common to use a
series of windows to analyze the signal. The choice of how short or long each window is,
the type of window, and with the amount of overlap used, depends on the type of signal
21
analyzed (e.g., voice, music etc.), the application (e.g., real time, non-real time etc.), and
the computational power available (e.g., memory, MIPS etc).
Figure 2.3.2 Contour plot for the ideal time-frequency distribution of a chirp signal
There are many existing time-frequency distributions. Although it is not possible to
obtain the perfect time-frequency distribution in figure 2.3.2, these distributions approach
the ideal case. The spectrogram is the most basic of time-frequency distributions. It is
calculated by performing Short Term Fourier Transforms on consecutive windows and
combining the results to give a plot similar to the one in figure 2.3.3, which shows the
spectrogram of the chirp signal discussed earlier. The domain is the time axis, the image
is the frequency axis and the magnitude is grayscale. It is seen that it provides a much
better analysis of the chirp signal than the FFT performed over its entire length (shown in
figure 2.3.1b). The spectrogram accurately reveals that the frequency of the signal
increases linearly with time, but it also shows a lot of spectral spreading at any given
point in time, due to windowing effects discussed in the previous section.
There are other methods such as the Wigner distribution, the Choi-Williams distribution
and wavelet analysis, which can be used to produce different types of time-frequency
distributions. These are not discussed since they are not relevant to this study.
22
Figure 2.3.3 Spectrogram of Chirp Signal
2.4 FILTER BANKS
Filter banks are used to break up a signal into the desired number of frequency bands. By
doing this, the flexibility to adjust certain parameters used in analysis is retained,
depending on which band is being analyzed. MPEG algorithms use filter banks to split
the signal into bands that resemble human ear critical bands (chapter 3.3) and then apply
a psychoacoustic model to each band, to detect and discard redundant data.
Filter banks are not systems that simply consist of a bank of simple band-pass filters that
break up the signal into the required bands. If this were the case, redundancy would be
added to the data. As an example, consider an N point signal sampled at
s
F , which is to
23
be broken up into 10 bands of equal width. If a bank of simple band-pass filters is used to
do the job, the result is 10 N point signals, each being sampled at
s
F . This would mean
that there were a total of 10xN points but no more information than before. This is
because each band uses only 1/10
th
of the total spectral bandwidth that is available to
represent it. Consider the lowest frequency band, which is being highly over-sampled at
s
F . It only has information between 0 and /10. According to the Nyquist theorem, it
can easily be sampled at a rate of 10 /
s
F without any aliasing distortion thereby reducing
data by a factor of 10 for that band. Similarly all other bands can also be down-sampled
by a factor of 10 without losing any information, thereby resulting in a total of N points
which were started with. How this is achieved using filter banks is now explained.
Filter banks are multirate systems that not only divide the signal into the required number
of frequency bands, but also change the sampling rate of each band depending on its
frequency bandwidth (Vaidyanathan, 1993). Filter banks use decimation (or down-
sampling) during analysis and expansion (or up-sampling or interpolation) during re-
synthesis. An M-fold decimator reduces the sampling rate by a factor of M and an L-fold
expander increases the sampling rate by a factor of L. These are shown in figures 2.4.1(a)
and (b).
Figure 2.4.1 (a) M-fold decimator (b) L-fold expander
(Taken from Vaidyanathan, 1993)
In this study, the filter bank is only used for analysis and not reconstruction, so the main
concern is decimation. The M-fold decimator simply outputs every M
th
sample of the
signal and discards all samples in between. This is shown in figure 2.4.2 and is equivalent
to reducing the sampling rate by a factor of M. Mathematically, this is represented as
shown in equation 2.4.1.
24
Figure 2.4.2 Demonstration of decimation for M=2
(Taken from Vaidyanathan, 1993)
( ) ( ) Mn x n y
D
(2.4.1)
The decimator not only changes the sampling rate of the signal but also the spectrum. The
relation between the spectrum of the decimated signal ( ) n y
D
and the original signal ( ) n x
is as follows:
( ) ( )

1
0
/ ) 2 (
1
M
k
M k w j jw
D
e X
M
e Y

(2.4.2)
(Taken from Vaidyanathan, 1993)
This can be graphically interpreted as follows: (a) stretch ( )
jw
e X by a factor M to obtain
( )
M jw
e X
/
, (b) create M-1 copies of this stretched version by shifting it uniformly in
successive amounts of 2 , and (c) add all these shifted and stretched versions to the
unshifted version ( )
M jw
e X
/
, and divide by M. This is shown in figure 2.4.3
25
Figure 2.4.3 Demonstrating the frequency domain effect of decimation with M=3
(Taken from Vaidyanathan, 1993)
This means that a signal can be down-sampled by any required factor. But certain
conditions have to be fulfilled to avoid aliasing effects, which occur due to this stretching
and adding of spectra. Consider the case where a signal of sampling rate
s
F is low-pass
filtered and high-pass filtered at /2 to yield two signals both of which retain the original
sampling rate of
s
F . The effects of down-sampling both signals by a factor of M=2 are
now analyzed. Down-sampling by a factor of 2 is equivalent to reducing the sampling
rate by a factor of 2. The low frequency band signal must have a band limit at /2. This
is because its spectrum gets stretched by a factor of 2 and added to shifted versions of
itself centered at multiples of 2 . If the original low passed signal has a bandwidth of
more than /2, the new signal will stretch to beyond and get added to the shifted
versions thereby resulting in aliasing. Even the high passed signal can be down-sampled
by a factor of 2 as long as its lower frequency limit exceeds /2. Again in this case, its
26
spectrum gets stretched and added to shifted versions of itself. As long as the original
signal is band-limited to between /2 and , no aliasing occurs. All that happens is that
its spectrum in the high frequency region between /2 and now gets mirrored between
0 and /2 and stretched so that it is between 0 and in the new signal. As an example,
if a chirp signal rising from 2 kHz to 4 kHz, originally sampled at 8 kHz is down-sampled
by a factor of 2, the result is a chirp signal with frequency falling from 2 kHz to DC.
There is no aliasing or information loss, but the resulting signal does not sound at all like
the original. The original signal can be recovered though, by using a system for
conversion, which is discussed in later chapters.
Figure 2.4.4 Analysis filter bank
(Taken from Kotvis, 1997)
A typical filter bank uses the technique described above to divide the signal into the
required number of bands. Figure 2.4.4 shows an analysis filter bank implementation
where the signal is divided into 10 octaves. Assume that the signal ( ) n x has N points, is
sampled at
s
F and is band-limited to
s
F /2. Here
1
H and
0
H are complimentary high
pass and low pass filters both of which have cutoff frequencies at /2. Both are also
followed by a down sampling operator. At the first stage, the cutoff frequency of /2
corresponds to
s
F /4. The high passed version of the signal, which represents content
between
s
F /4 and
s
F /2 is down-sampled by a factor of 2 to yield ( ) n y
10
. ( ) n y
10
contains
only N/2 points after the down-sampling operation. The low-passed signal is also down-
27
sampled and contains frequency information between 0 and
s
F /4 and has N/2 points with
an effective sampling rate of
s
F /2. This signal is again filtered using the same pair of
filters as before, but in this case the cutoff frequency of /2 corresponds to
s
F /8 since
the signal has already been down-sampled once. Now the high passed version of this
signal containing frequency information between
s
F /8 and
s
F /4 is again down-sampled
to yield ( ) n y
9
which contains only N/4 points. This process is continued till the signal is
successfully divided into 10 bands. Each band contains only half the number of points as
the previous band due to down-sampling (except the last band
1
y ). The total number of
points added up from all the signals
1
y to
10
y is N.
It is important to have a proper choice of the pair of filters H1 and H0. They need to have
narrow transition bands and cutoff frequencies adjusted to minimize aliasing effects. In
chapter 3, a similar filter bank to the one shown above is used and the exact parameters
used in the filter bank and the filters are described.
2.5 SYNTHESIS TECHNIQUES
Sound synthesis is the generation of a signal that creates a desired acoustic sensation
(Dodge & J erse, 1997). In computer music, there are various synthesis techniques that are
used for used for generating sounds that imitate real instruments. In fact, in computer
music, the term instrument refers to an algorithm that realizes or performs a musical
event. In this study, analysis based synthesis techniques are used. Since the aim is to
obtain certain parameters from the analysis and then re-synthesize the music based on
these parameters, a brief review of some synthesis techniques is presented in the
following sections.
The Oscillator
The unit generator that is fundamental to almost all computer sound synthesis is called
the oscillator. An oscillator generates a periodic waveform that can be of various types
28
such as sinusoidal, square or saw-tooth. The controls applied to an oscillator determine
amplitude, frequency and phase of the waveform it produces. A flowchart symbol for an
oscillator with its various controls is shown in figure 2.5.1.
Figure 2.5.1 Flow chart symbol for an Oscillator
(Taken from Dodge & J erse, 1997)
Additive Synthesis
For a certain tone, if the spectral components and their magnitudes are known, then each
of these components can be modeled using an oscillator with the appropriate frequency
and phase functions as well as an amplitude envelope. In this method, each partial of the
desired tone is represented by one oscillator with the above mentioned parameters. Thus,
adding up the output from each oscillator can generate the desired tone. This is known as
additive synthesis and is shown in figure 2.5.3. The amplitude and frequency parameters
of the real tone can be obtained by using some of the (time-frequency) techniques
mentioned earlier in this chapter. The name Fourier re-composition is sometimes used to
describe synthesis from analysis, because it can be thought of as the reconstitution of the
time varying Fourier components of the sound. Additive synthesis has proved capable of
generating tones that are virtually indistinguishable from the original tone even by trained
musicians. The only problem is that sometimes a large number of oscillators are required
to generate the given tone.
29
Figure2.5.3 Basic configuration for additive synthesis
(Taken from Dodge & J erse, 1997)
Synthesis Using Spectral Interpolation
The techniques described above can be used to recreate very natural sounding tones. The
parameters of any tone being analyzed though are constantly changing and therefore the
parameters of each oscillator need to keep changing, with each frame in time. If the
signal is synthesized frame by frame, based on the parameters for each frame, and then
the synthesized frames are simply arranged in the correct order, there may be
discontinuities in the synthesized signal at frame boundaries. One way to solve this
problem is to cross-fade between frames. Another possible method is to use simple linear
interpolation for each parameter from frame to frame. Linear interpolation works well for
amplitude envelopes but can create discontinuities in the signal if used directly for
frequency. A solution for the synthesis problem is presented in later chapters.
30
2.6 CLASSICAL AND MODERN STUDIES OF TIMBRE
The classical theory of timbre is based on the Helmholtz model. Hermann Von Helmholtz
laid the foundation for the modern theory of timbre in his 19
th
century work, On the
Sensations of Tone. He characterized tones as consisting of several waveforms of
different frequencies enclosed in an amplitude envelope consisting of three parts: the
attack, steady state, and decay portions. An interesting part of this study was that he
concluded that all the partials (spectral peaks) in a tone have the same attack, steady state,
and decay times. Modern studies show that this is usually not the case.
J ean Claude Risset, in his 1966 work, Computer Study of Trumpet Tones, employed an
FFT based algorithm to gain information about the spectral characteristics of a trumpet
tone (Dodge & J erse, 1997). Whereas Helmholtz and other researchers before him had
applied a Fourier Transform to the steady state section of the tone, Risset applied a series
of windows on his data and performed the FFT on each window. Thus, he was able to get
more accurate time-frequency information. He used windows that were between 5 and 50
ms for his analysis, and what he found was very different from what Helmholtz had
concluded about the fundamental frequency and all its partials having the same amplitude
envelope. He found that the partials did not have the same amplitude envelope as the
fundamental and even their frequencies varied with time. Figure 2.6.1 shows the
amplitude progressions of the partials of a trumpet tone.
It can be seen from this figure that higher harmonics attack last and decay first.
Additionally, each harmonic has fluctuations in frequency during the course of the tone
(and are especially erratic during the attack), quite similar to a vibrato effect, and re-
synthesis of the tone without these fluctuations produces a discernable change in the
character of the tone. J ohn Chowning and Michael McNabb have demonstrated the
importance of synthesizing the fluctuations in frequency of the various partials for the
output to be perceived as a fused single tone.
31
Figure2.6.1 Amplitude Progressions of the Partials of a Trumpet Tone
(Taken from Dodge & J erse, 1997)
2.7 THE PROPOSED SCHEME
The aim of this study is to develop an alternate representation for audio signals in the
form of a time-frequency matrix. Among the many applications for this discussed earlier,
the one that is explored in this study is audio compression. In most cases, audio
compression is achieved by removing irrelevant or redundant data from the music.
Usually the original music is preserved in some way, but some model is used to provide
variable bit allocation. Since the objective of this study is to form a time-frequency
matrix, the signal first has to be divided into segments of time which are so small that the
signal can be assumed to be almost stationary within that segment or frame. A frequency
analysis is then performed on that frame and the frequencies and their corresponding
magnitudes present in that frame are extracted. If a similar analysis is performed on all
32
the frames that the signal is divided into, the spectral characteristics of each frame in time
can be obtained. This information can be arranged in two matrices of the same size, both
of which have each column representing a frame in time. The first is the frequency
matrix and the second is the magnitude matrix. Each column of the frequency matrix
represents a frame and contains the frequencies that were found in that frame. The
elements of the magnitude matrix are the magnitudes of the corresponding elements
(frequencies) in the frequency matrix. The n
th
column of the first matrix contains the
frequencies of the n
th
frame. Together, the two matrices form a sort of a 3-D time-
frequency representation of the signal, which contains a greatly reduced amount of data
compared to the signal stored in the form of samples. The signal can be later re-
synthesized using the data in these matrices.
For reasons that are described more specifically later, non-orthogonal signal
decomposition is used for performing the spectral analysis on each frame and this is
described in chapter 4. This method is only the basis for developing an adaptive
algorithm that detects the frequency peaks and their magnitudes in each frame and stores
them in the form of a time-frequency matrix as described above. This algorithm is
discussed in chapter 5. The method is flexible enough to allow for a psychoacoustic
model to be used. Some fundamental psychoacoustic phenomena are described in chapter
3. The second part of this study comprises re-synthesizing the music using these
parameters, which have been extracted. Since conventional methods prove to be
inadequate, a frequency interpolation algorithm based on a cubic equation is used for this
re-synthesis. This is described in chapter 6. Chapter 7 consists of the results of this
experiment along with the conclusions.
33
Chapter 3 Psychoacoustics
Introduction
Psychoacoustics is a branch of psychophysics, which deals with the relationship between
acoustic stimuli and auditory sensation. It addresses the question why we hear what we
hear when we are exposed to a given acoustic stimulus. There are certain psychoacoustic
phenomena that occur when human beings hear sound, that have been studied intensively
and are the basis of the increasingly popular perceptual coders. Most of these coders
have psychoacoustic models that analyze the signal and detect those parts in it which
psychoacoustic research over the years has proved that the human ear cannot detect.
These coders then either discard this irrelevant data or code it at a very low bit rate,
thereby achieving audio compression. This chapter starts with a review of auditory
analysis in general and then presents some relevant psychoacoustic phenomena.
3.1 AUDITORY ANALYSIS
A comprehensive theory of auditory analysis did not exist before Helmholtzs work. The
second chapter of Sensations of Tone (Helmholtz, 1863) contains a summary of what
Helmholtz considered to be the main problems of auditory analysis. He pointed out that
people have no difficulty in following the individual instruments at concerts or directing
their attention at will to the words of a speaker. It follows that different trains of sound
can be propagated without mutual disturbance and that the ear can break down a complex
sound into its constituent elements. For an explanation of how this is done, he borrowed
from Ohms law on hearing, which states that the ear separates a complex sound into its
sinusoidal components similar to those in mathematical analysis. Helmholtz made the
34
important addition that other components such as difference tones, which are not
physically present in the stimulus, are the result of nonlinearity in the ear. He added that
the sensation of sound was due to the stimulation of nerves in the ear and every
discriminable pitch corresponded to a particular nerve or set of nerves.
Modern day research shows that much of what Helmholtz had assumed was true, but
there are inconsistencies that are yet to be resolved. The ear is divided into the outer ear,
the middle ear and the inner ear. The outer ear consists of the pinna and the external
canal, the middle ear contains the eardrum, which leads to the entrance (oval window) of
the inner ear. The inner ear consists of the spiraling, marble sized cochlea, whose
boundaries form the basilar membrane. Vibrations on the basilar membrane are picked up
by hair cells, which form a lining on it and are then transmitted to neurons, which pass
the signal to the brain. The basilar membrane has a stiffness coefficient that varies from
very high near the entrance of the inner ear to very low near the end. This results in
regions of resonance depending on the stiffness, which are stimulated by their respective
resonant frequencies. Thus, depending on the frequencies present in an audio signal,
different regions of the basilar membrane may be stimulated, each giving rise to the
sensation of a particular pitch. These facts are well known now and can explain how the
sensation of the pitch of a pure tone occurs. However they do not satisfactorily explain
phenomena such as difference tones.
3.2 FREQUENCY RESOLUTION OF THE HUMAN EAR
There is a natural limit to an individuals ability to establish a relative order of pitch when
two tones (of same intensity) are presented one after another. When the difference in
frequency between the two tones is very small, both tones are judged as having the same
pitch. This difference limen is known as the just noticeable difference (J ND) in
frequency (Roederer, 1995). If the variation between the two tones exceeds the J ND, a
change of pitch is detected. The degree of sensitivity to pitch changes or frequency
resolution capability depends on the frequency, intensity and duration of the tone in
35
question and on the suddenness of frequency change. Figure 3.2.1 shows the J ND in
frequency plotted against the frequency of the pure tone in question for a typical human
subject, when the pure tone is varied slowly.
Figure 3.2.1 J ND in frequency of a pure tone
(Taken from Roderer, 1995)
It is interesting to note that below 500Hz, the resolution is constant at around 3 Hz and
starts rising only at higher frequencies. The dotted lines indicate the J ND in frequency as
a percentage of the frequency of the tone in question, and it can be seen that in this
respect J ND is large at low frequencies. This is the reason why bass guitarists prefer to
tune to the harmonics on their bass guitars, because the frequency of the fundamental
note is too low for good frequency resolution in the ear.
36
3.3 MASKING
The phenomenon of masking, which occurs in human hearing, is commonly used by
perceptual coders such as MPEG. Masking is the process by which one sound at a lower
sound pressure level (known as the maskee), is rendered inaudible to the human ear due
to the presence of another sound at a higher sound pressure level (known as the masker),
which is presented either simultaneously or offset by a small time difference.
Extensive studies have been done to find out masking thresholds and to find out how the
frequency difference between the two sounds affects the masking thresholds. The
phenomenon of simultaneous masking is discussed in this chapter since it is relevant to
the study.
Figure 3.3.1 Curves of equal loudness
(Taken from Roderer, 1995)
It is interesting to note that human beings are generally more sensitive to certain
frequency ranges than other frequency ranges. Tones of equal sound pressure level (SPL)
but different frequencies are judged as having different loudnesses. Thus SPL is not a
37
good measure of loudness when comparing tones of different frequencies. Experiments
have been performed to establish curves of equal loudness, taking the SPL at 1000Hz as
the reference quantity. These are shown in figure 3.3.1 and it can be seen that except at
very high sound pressure levels, human beings are typically more sensitive to frequencies
in the 1 kHz - 4 kHz range.
Figure 3.3.2 Masking curves for a 415 Hz masker at different levels.
(Taken from Roderer, 1995)
When presented with a pure tone at a given SPL, there is a certain minimum change in
SPL that is required to give rise to a change in loudness sensation. This is known as the
just noticeable difference (J ND) in sound level and is roughly constant at the order of 0.2
0.4dB for the musically relevant range of pitch and loudness. Equivalently, it is the
minimum intensity that a second tone of the same frequency and phase must have, to be
noticed in the presence of the first tone, whose intensity is kept constant. This minimum
intensity is known as the threshold of masking. This is for the case of two tones of equal
frequency. Masking also takes place when two tones of different frequencies are
presented together. The masking level is determined as the minimum intensity level that
the masked tone must exceed in order for it to be singled out and heard in the presence
of a masking tone. This masking threshold depends heavily on the frequency difference
between the two tones. It has been found that masking effects are more predominant
38
(masking threshold is high) when frequency separation between the masker and maskee
is small. Also, in general, lower frequency tones mask higher frequency tones more
effectively. Figure 3.3.2 shows the masking curves for a 415 Hz masker. These are
basically plots of masking thresholds vs frequency for a maskee in the presence of a
given masker. Here each curve represents the thresholds given a 415 Hz masker
presented at the loudness level indicated within each curve.
The masking thresholds are quite high for maskees whose frequencies are in the vicinity
of the masker. It can be concluded that masking is effective only in a particular band of
frequencies around the frequency of the masker and masking effects for tones that are far
away in frequency can be ignored except at extremely high sound pressure levels. So far,
the phenomenon of masking has been discussed without discussing why it occurs. In this
respect, the critical band concept is commonly used to explain it and is the topic of the
next section.
3.4 CRITICAL BANDS
Fletcher (1940) proposed the critical band concept to account for many of the phenomena
of masking. He suggested that different frequencies produce their maximal effects at
different locations along the basilar membrane, and that each of these locations responds
to a limited range of frequencies. The range of frequencies to which a particular segment
responds is its critical band. In this respect, it is useful to view the basilar membrane as a
series of band-pass filters with a certain bandwidth corresponding to the critical band
(Tobias, 1970). When a tone whose frequency corresponds to a certain segment on the
basilar membrane is masked by wide band noise, only the frequencies of the noise that
fall within the bandwidth of that section (its critical band) are effective in masking the
tone. According to Fletcher, the tone is just detectable when its energy is equal to the
energy of the noise that affects that critical band. Fletcher says When the ear is
stimulated by a sound, particular nerve fibers terminating in the basilar membrane are
39
caused to discharge their unit loads. Such nerve fibers can then no longer be used to carry
any other message to the brain by being stimulated by any other source of sound.
Experiments have shown that when two tones whose difference in frequency is below a
certain limit are presented together, since they both stimulate the same region on the
basilar membrane, it is not possible to resolve two separate tones. Instead, a beating
sensation is heard as a result of the amplitude modulation that takes place. If the
frequencies are separated further beyond that limit but within the critical band, two tones
can be perceived but a sensation of roughness persists. It is only when the frequency
separation exceeds the critical band that the sensation of roughness disappears and both
tones sound smooth and pleasing. The critical band concept has been used to explain
many other phenomena including that of frequency dependant loudness summation of
multiple tones, and it is one of the most useful discoveries in the field of psychoacoustics
and music theory.
40
Chapter 4 Non Orthogonal Signal Decomposition
Introduction
The first step in performing a time-frequency analysis is dividing the signal into small
frames and performing a spectral analysis on each frame. Since finite length windows are
used, the spectrum (or the DTFT) will have the smearing effect described in chapter 2.2.
In particular, for the case of a rectangular window, the DTFT of the windowed signal can
be obtained by convolving the long-term spectrum of the signal with the sinc function
corresponding to the spectrum of the window function. If the frequencies actually present
during that frame are to be extracted, they could be closely approximated by extracting
the peaks in the DTFT. To detect the frequency peaks in the DTFT, a DFT analysis
(using an FFT algorithm) could possibly be performed on the signal and peak frequency
bins in the DFT could be located. But the peak frequency bins in the DFT are not
necessarily equal to the peak frequencies in the DTFT (except for the case where the
signal contains a single frequency, which lines up exactly with a frequency bin). This is
because the bins in the DFT are already fixed regardless of the nature of the signal. The
peaks in the DTFT though, depend purely on the nature of the signal.
The aim now is to find the exact peaks in the DTFT. The DFT is not well suited to this
because of the fact that it has fixed frequency bins at equal spacing. An adaptive
algorithm is developed, which adapts to the signal and locates the exact peaks in the
DTFT. To accomplish this, a non-orthogonal signal decomposition method is used, by
which the DTFT can be sampled at any desired frequency. The following section reviews
the non-orthogonal signal decomposition method.
4.1 THEORY AND COMPUTATION
The fundamental operation in any transformation of a time domain signal into its
frequency equivalent is the decomposition of the signal into a predetermined set of
41
sinusoidal functions. These transformations offer an alternative description of the time
domain signal as a linear combination of a set of basic functions such as sinusoids,
exponential sinusoids and pure exponentials. The DFT decomposes the signal into a set
of frequencies, which are equally spaced between DC and the Nyquist frequency. It gives
a set of magnitude and phase coefficients corresponding to each frequency. These can be
converted to a set of a and b coefficients corresponding to each frequency bin such that
the signal can be reconstructed as the linear sum of each frequency multiplied by the
coefficient that was calculated. This is shown in equation 4.1.1. Based on DFT principles,
the signal at any time instant n, n=1,, N is given by the following sum of weighted
sines and cosines:
( ) ( ) ( ) ( ) ( ) [ ]

+
2 /
0
/ 1 2 cos / 1 2 sin
N
i
i i
N i n b N i n a n s , n=1,, N
(4.1.1)
This relationship is used in the new decomposition technique. The basic theory behind
signal decomposition using non-orthogonal sinusoidal bases is discussed by Dologlou et
al., 1996. This study demonstrates that any signal can be decomposed using virtually any
group of sinusoids. If non-orthogonal bases are used for decomposition, any set of
frequencies can be chosen to decompose the signal. In other words, the frequencies where
the DTFT of the signal is sampled can be chosen.
The computation is fairly simple. First, N frequencies that are desired for decomposition
are selected between 0 and half the sampling rate
s
F . These frequencies are converted
into their digital equivalents between 0 and using
n
=2
n
f /
s
F , n=1, 2, N where
n
f is the frequency being converted and
n
is the digital equivalent. Next, a matrix of
size 2Nx2N is formed using sine and cosine vectors as shown in equation 4.1.2
42
( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( )
1
1
1
1
1
1
1
1
1
]
1

N N N N N
N
N
N
N
A
2 2 2 2 2
3 3 3 3 3
2 2 2 2 2
1 1 1 1 1
sin 2 sin 2 sin sin sin
cos 2 cos 2 sin cos sin
cos 2 cos 2 sin cos sin
cos 2 cos 2 sin cos sin
1 1 0 1 0




(4.1.2)
The matrix Acontains alternate sine and cosine vectors in each column. Each column has
the first 2N samples in time corresponding to that sine or cosine vector. The last step is
the inversion of the matrix
1
A B . The matrix B decomposes any signal into the
chosen frequency components. If the signal is given by ( ) n x and it is decomposed into
the chosen frequencies to give ( ) f X , the relation between the two is:
x B X . (analysis)
(4.1.3)
X A x . (synthesis)
Here both A and B matrices are of the order 2Nx2N. Matrices x and X are of the order
2Nx1. The decomposition is in the form of sine and cosine coefficients for each
frequency. The magnitude and phase of each frequency can be calculated using:
( ) ( )
2 2
2 1 2 | ) ( | n X n X n F + (4.1.4)
( ) ( ) ( ) ( ) n X n X n F phase 2 / 1 2 tan ) (
1


(4.1.5)
In principle, this transform is similar to the DFT, but there are some basic differences.
First and most important is the fact that the frequencies into which the signal is to be
decomposed can be chosen whereas a DFT has fixed bins. Second, the DFT has
43
coefficients for both positive and negative frequencies whereas this transform works for
only positive frequencies. Third, the DFT can operate on complex signals, but this
transform works only on real valued signals. Finally, the DFT forces one of the bins to be
located at DC and another one to be located at the Nyquist frequency regardless of the
number of points used. These two frequencies need not be used in the new transform.
4.2 SUMMARY
In summary, the most useful property of this transform is that the frequencies used for
decomposition are selectable. The transform itself is not adaptive in any way, but this
property can be used to incorporate an adaptive algorithm discussed in the next chapter. It
is useful to remember that this transform does not give better frequency or time resolution
than the DFT. The only practical difference between the two is that the DFT samples the
DTFT at a fixed set of points whereas the new transform is capable of sampling the
DTFT at any desired frequency. The DTFT itself does not change in any way and
contains all the artifacts that come with sampling and windowing. Each frequency that is
found in a time frame still appears in the spectrum as a sinc function and indeed there is
interaction between the sinc functions of adjacent frequencies. In fact if two components
that are very close in frequency are present in the signal, the interaction between their
respective sinc functions makes it difficult to see two distinct frequencies in the total
spectrum. This is a windowing issue and needs to be dealt with separately.
In the light of this new freedom of selecting the frequencies for decomposition,
frequencies between DC and the Nyquist frequency can either be selected at random or
by using some criterion. In the next chapter the issue of selecting frequencies that are
useful during adaptation is discussed.
44
Chapter 5 Adaptive Time/Frequency Analysis
The frequency resolution problem encountered during short-term analysis was discussed
in the introduction of chapter 4. This chapter starts with a discussion on why adaptation is
necessary followed by a review of earlier work using similar algorithms. The issue of
how adaptation is used to get better frequency resolution is then elaborated, and the
chapter ends with an explanation of how the time-frequency matrix is set up.
5.1 WHY ADAPTATION?
The aim of the procedure is to be able to capture the frequency information in each frame
and store it in a time-frequency matrix. This automatically results in data reduction
because there are only a limited number of frequencies actually present in an audio signal
during a given time frame. As an example, consider the signal to be a pure tone of 400Hz.
At sampling frequency 32kHz, a time frame of 40ms contains 1280 samples of data. This
frame of 1280 samples is shown in figure 5.1.1a. If an FFT is performed on the same
signal using the same number of points, the plot shown in figure 5.1.1b is obtained.
The dotted line in figure 5.1.1b shows the DTFT of the 400 Hz signal. The solid line is
the FFT, which is the sampled version of the DTFT. It is seen that index number 17 of the
FFT has all the energy because there are frequency bins at every 25Hz, and the frequency
of 400Hz exactly lines up with frequency bin number 17. If a peak detector is used in the
frequency domain, it will detect a peak at bin number 17 which means that bin 17 has the
highest amount of energy among all the bins. Instead of storing the 1280 samples
required to represent the signal in the time domain, the only information which needs to
be stored is the fact that the frequency of the signal is 400Hz as was found in the FFT and
the magnitude of this 400 Hz signal as specified by the FFT. This is a large reduction in
data.
45
Figure 5.1.1a 40ms frame containing 1280 samples of a 400 Hz signal, Fs=32 kHz
The above case is the simplest. Music usually has multiple frequencies and the
frequencies generally do not line up with frequency bins thereby resulting in spectral
spreading. When a 390 Hz signal of the same length is analyzed, since the frequency of
390 Hz is between the frequency bins at 375 Hz and 400 Hz, spectral spreading is seen in
the FFT. This is shown in figure 5.1.2.
The problem here is that if fixed frequency bins are used, it is not possible to find out the
actual frequency of the signal when it doesnt line up with a bin. In this case, a peak
detector will again find that the highest amount of energy in the FFT is at bin number 17
which corresponds to 400Hz. The fact that the actual frequency is 390Hz cannot be
determined unless some special techniques such as the adaptation technique, which is
described later, are used.
46
Figure 5.1.1b DFT and FFT of a 40ms frame of a 400Hz signal
When analyzing real music with multiple frequencies, a peak detector is used to detect all
the peaks in the DFT in order to find the frequencies that are actually present in that
frame. This is followed by the adaptation algorithm, which locates the frequency of the
detected peak more accurately. But even before adaptation, there are some inherent
problems that must be noted. Consider the case of signal consisting of two closely spaced
frequencies of 390 Hz and 420Hz. If a DTFT analysis is performed on a 20 ms frame of
this signal, the DTFTs of the individual frequencies add up to give the DTFT of the
combination. This is shown in figure 5.1.3.
It is seen from this figure that although the signal is made up of two distinct frequencies
at 390Hz and 420Hz, the DTFT of the combination has only one peak at 404Hz. Even if a
peak detector is used followed by the adaptation algorithm that can exactly locate this
peak in the DTFT, it gives an inaccurate result because it finds only one peak at 404Hz
instead of one at 390Hz and one at 420Hz. This is because the DTFTs of the individual
frequencies are in the form of a sinc function and if the frequencies are very close
together, the side lobes and the main lobes of the two frequencies add up to give a result
that is very different from what is expected.
47
Figure 5.1.2 DTFT and FFT of a 390Hz signal of length 40ms
The problem gets worse as the separation between the individual frequencies gets
smaller. It also gets worse as the length of the window in time gets smaller. This is
because the smaller the window in time, the wider the individual spectra become and the
more the interaction between frequencies. Therefore, the peaks in a complex audio signal
are moved due to frequency interaction and no longer correspond to the exact
frequencies present in the signal. This is purely a result of windowing the data and
analyzing it in small frames in time. It cannot be solved or improved using the adaptation
algorithm because the algorithm is only meant for finding the peaks in the DTFT of the
combined signal. If the DTFT of the combined signal is already distorted, the algorithm
simply finds the peaks that are slightly shifted in this distorted DTFT. The longer the
frames that are used, the more this problem can be reduced. But when longer frames are
used, time resolution and transient information is lost.
48
Figure 5.1.3 DTFTs of individual tones of 390Hz and 420Hz along with the DTFT of the
combination tone
In general, it is preferable to use windows that are as long as possible without losing too
much time resolution. Another approach to solving this problem is by using special
windows that have reduced side lobes at the cost of wider main lobes. Some special
windows were experimented with, but it was found that the problem became worse due to
the widening of the main lobe. As long as frequencies in the signal are not very closely
spaced this problem is minimal. But at lower frequencies, where frequency separation
between individual notes tends to be small, the problem can be drastic. By dividing the
signal into frequency bands using a filter bank, the lower frequency bands can be
processed using larger frames as shown later.
49
5.2 ADAPTATION
The adaptation algorithm used in this study is based on the algorithm developed by
Dologlou et al., 1996. The basic principle of the algorithm is very simple and is based on
a binary search method for literally zooming in on the frequency peak of the DTFT.
Consider the case of the 390Hz tone spanning over a frame of 40ms discussed earlier. Its
DTFT and 1280-point FFT are shown in figure 5.1.2. As can be seen from the figure,
since there is no frequency bin at 390Hz, the closest frequency bin to 390Hz which is at
400Hz contains the highest energy. If a peak detector algorithm is used, it finds the peak
at 400Hz. The objective is now to refine this estimate and get closer to the actual peak in
the DTFT, which is at 390Hz. First of all, it is important to note that this peak bin is
always found on the main lobe of the sinc function of the DTFT of the actual frequency.
It is now assumed that the DTFT can be sampled at any desired frequency (which is
valid since a transform to do this was developed in chapter 4). If the DTFT is sampled at
frequencies just above (e.g. 400.001Hz) and just below 400Hz (399.99Hz), it is found
that the frequency just below 400Hz has slightly more energy than the point above (since
the peak is at 390Hz). This is shown in figure 5.2.1a.
This leads to the conclusion that the peak of the DTFT is below 400Hz (which is true in
this case because the peak is at 390Hz) but above 387.5Hz (since the previous frequency
bin is at 375Hz). It is guaranteed that the peak frequency is at a frequency that is higher
than the midpoint of 375Hz and 400Hz (387.5Hz). This is because if the peak frequency
were less than 387.5Hz, the peak frequency bin would have been at the 375Hz. With this
in mind, the spectrum is sampled at the midpoint between 400Hz and 387.5Hz
(393.75Hz). Here, the direction the slope is headed in is again checked, by sampling the
DTFT at frequencies just above and just below 393.75Hz. This is shown in figure 5.2.1b.
It is found again that higher energy is present in the frequency just below 393.75Hz.
Again, it is concluded that the peak frequency is below 393.75Hz but above 387.5Hz.
The spectrum is again sampled at a frequency half way between these two points and the
procedure is continued for as many iterations as required. At each step, the frequencies
50
just above and just below the frequency found in that iteration are sampled to find out
whether the peak is at a higher or lower frequency than the present frequency. Using this
method, it is possible to get as accurate an estimate of the peak frequency as required.
Figure 5.2.1a Sampling the spectrum at frequencies just above and just below the 400 Hz
peak frequency bin
Figures 5.2.1b Sampling the spectrum at frequencies just above and just below 393.75Hz
51
The algorithm just described can be implemented only if the DTFT can be sampled at the
required frequencies. This cannot be done using conventional FFTs or DFTs, which have
fixed bins. But if the transform described in chapter 4 is used, frequency bins can be
chosen arbitrarily. This means that the freedom of sampling the DTFT at different
frequencies, which was assumed above can be implemented with this new transform.
Selection of Frequencies
The basic procedure for the adaptation algorithm has been described in the previous
section. With this in mind, the initial set of frequencies used for decomposition can be
selected to give the best result for a given number of frequencies. Assume that N
frequencies between 0 and are to be selected to decompose a signal with 2N points. An
FFT would have N equally spaced frequencies between 0 and including 0 and which
correspond to DC and the Nyquist frequency. If adaptation is being used though, the
algorithm can adapt itself to any frequency within half the frequency interval between
adjacent bins and that too on both sides of the peak bin frequency bin. In the case
described in the previous section describing adaptation, the frequency spacing between
bins is 25Hz. If the peak frequency bin is found to be at 400Hz, the algorithm can adapt
itself to any frequency within 12.5Hz on either side of 400Hz. This holds true for all
frequency bins. This also means that frequency bins at DC or at the Nyquist frequency
are not required. Instead, it suffices to have frequency bins located at 12.5Hz above DC
and at 12.5Hz below the Nyquist frequency and spaced at 25Hz everywhere else. A
general spacing of frequencies is shown in figure 5.2.2. Here d2 is the spacing between
adjacent frequency bins and d1 is the spacing between DC and the first bin as well as the
spacing between the Nyquist frequency and the last bin.
52
Figure 5.2.2 Example of frequency spacing for the non-adaptive transform
The case where 2 / 2 1 d d is used since even if there is a frequency component at DC, it
is within the range of the first bin in terms of adaptation. The same applies with regard to
the Nyquist frequency and the last bin.
Adaptive Algorithm by Dologlou et al., 1996
The algorithm developed by Dologlou et al is shown in figure 5.2.2. This algorithm is
based on minimizing the energy error due to spectral spreading and was implemented
only for the case of a simple pure tone signal. The algorithm starts with some
initializations for variables. The initial set of frequencies to be used for the transform is
then selected. After computing the transform using these frequencies, a peak detection
algorithm is used to detect the peak in the calculated spectrum.
53
Figure 5.2.3 Adaptive algorithm by Dologlou et al.
54
The energy error, which is defined as the sum of the energy in all the bins except the peak
bin is calculated. The peak frequency is replaced by a frequency that is just below it by an
amount MaxDifference. The transform is then recalculated using this new frequency. The
new energy error is calculated using the same criterion as before after applying the new
transform. If it is less than the previous energy error, the peak frequency is replaced by
the mean of the peak frequency bin and the frequency bin just above it. Otherwise, it is
replaced by the mean of the peak frequency bin and the frequency bin just below it. It
must be noted that the frequencies of the peak bin and that of the bins below and above it
are being continuously updated in this process. The process terminates when the peak bin
calculated in the present iteration and the one calculated in the previous iteration differ by
an amount less than maxdiff. It should be noted that this method works well only for
signals which have a single frequency component. The energy error criterion becomes
difficult to use for signals that have multiple frequencies.
Adaptive Algorithm by Matt Kotvis, 1997
Matt Kotvis improved upon the above mentioned algorithm to produce a more efficient
method better suited to typical audio signals. He used filter banks to decompose the
signal into 10 octaves starting at 22.05kHz and downwards. Each octave was analyzed
using a different window length depending on the number of frequencies that were used
in the transform for that octave. The number of frequencies used for decomposition
decided the size of the decomposition matrix and therefore the window length. The table
showing the window lengths in milliseconds for a 4, 8 and 16 frequency transform is
shown in figure 5.2.3.
He found that the arrangement that gave the best trade off was using 16 frequencies in the
highest seven octaves, 8 frequencies in the third octave and 4 frequencies in the lowest
two octaves. Since the consecutive octaves were down sampled (described in chapter
2.4), the same transformation matrix containing 16 frequencies between 0 and could be
55
used for the top seven octaves followed by an 8 frequency matrix for the third octave and
the same 4 frequency matrix for the first two octaves. Since the same matrix was used for
different octaves, a conversion was required to calculate the actual frequency in that
octave and make up for the down sampling operation.
Octave 4 Frequencies 8 Frequencies 16 Frequencies
10 (10 20 kHz) 0.363ms 0.726ms 1.45ms
9 (5 10 kHz) 0.726ms 1.45ms 2.9ms
8 (2.5 5 kHz) 1.45ms 2.9ms 5.8ms
7 (1.25 2.5 kHz) 2.9ms 5.8ms 11.6ms
6 (625 1250 Hz) 5.8ms 11.6ms 23.2ms
5 (312 625 Hz) 11.6ms 23.2ms 46.4ms
4 (156 312 Hz) 23.2ms 46.4ms 92.9ms
3 (78 156 Hz) 46.4ms 92.9ms 186ms
2 (39 78 Hz) 92.9ms 186ms 372ms
1 (0 39 Hz) 186ms 372ms 743ms
Figure 5.2.4 Table showing octaves and analysis window lengths depending on number
of frequencies used in the transform
The flow chart for his algorithm is shown in figure 5.2.5. This algorithm works basically
the same way as the one by Dologlou et al., but with some additional improvements. First
the total energy in each time block is calculated. All blocks with energy below a certain
threshold are ignored during the adaptation process. This is to guard against very quiet
sections which have a very low signal to noise ratio thereby resulting in the detection of
totally irrelevant frequencies during that period. Second, several frequency peaks could
be adapted to, in a single block provided that it is established that there is more than one
frequency peak in that time block of that octave. Also no frequency which is less than 10
dB of the maximum energy in a block is adapted to. Additionally, no more than one
quarter of the frequencies in a time block could be adapted to.
56
Figure 5.2.5 Flowchart for Matt Kotvis algorithm
(Taken from Kotvis, 1997)
57
The only purpose of having these constraints is to reduce the computation time by getting
rid of time/frequency points without significant energy. Another improvement in the
algorithm is a condition that forces any time/frequency point to stay within certain
boundaries during adaptation. In the previous algorithm, for certain signals, the
adaptation algorithm would not converge on nearby frequencies but would instead
continuously move to a higher or lower frequency till until it was virtually on top of
another frequency. This problem is solved by forcing the adaptation to move inwards
after the first iteration. This is necessary because the first iteration samples the DTFT at
half the frequency interval between the lower and higher bin. If the algorithm were
allowed to move continuously higher or lower, it would converge on the adjacent bin. So,
if the first iteration is towards the left (lower in frequency), the second iteration is
towards the right. The other criterion that is changed in this improved algorithm is the
criterion for ceasing adaptation. Whereas in the previous case, the criterion for ceasing
adaptation was when the difference between the adapted frequency of the present and the
previous iteration was below a certain threshold, in this case it ceases adaptation after
exactly 10 iterations in all cases.
This algorithm has certain drawbacks. Unlike the non-adaptive case, where one
frequency set is used, the adaptive distribution could have a large number of frequencies
used for decomposition. This could potentially create a different frequency set used for
each and every time-frequency block and this leads to large data size. Another
disadvantage with this algorithm is that it does not know when a signal has been exactly
matched. For example, even if a signal lies exactly on a frequency bin, it tries to adapt to
it by shifting the frequency of the bin away from it and then continuously towards it,
thereby minimizing error after 10 iterations. Also, the time-frequency blocks that are used
are fixed. There are only 3 sets of transforms using 4, 8 or 16 frequencies, which are used
during decomposition and block sizes for each band are fixed. Considering that different
types of audio data have different properties in terms of frequency content and dynamics,
this may not be the best approach. A better approach would be to have flexible block
lengths, a flexible number of frequencies used for decomposing and a flexible number of
58
iterations for adapting in each block depending on the type of music or audio signal being
processed. Also, the fact that there are only 3 sets of transforms used implies that there
are only certain window lengths in time that can be used. In the case of this algorithm, the
lowest sub-band (discussed later) is processed at 186 ms, which is too large a window
length for music and leads to very poor time resolution. The highest sub-band is
processed at window lengths of 1.45 ms, which is too small and leads to increased
spectral spreading.
5.3 IMPROVED ADAPTIVE ALGORITHM
The algorithms mentioned above have certain weaknesses that are corrected in the
improved algorithm developed in this study. The improvements are as follows:
Improvements
1) The algorithm by Matt Kotvis uses filter banks to divide the signal into frequency
bands but does not make full use of them. Based on the properties of the signal, it is
useful to be able to process different bands of data using different parameters. The
algorithm developed in this study allows for any required window length to be used
for any sub-band. Thus the optimum window length for a given sub-band can be used
while processing it.
2) The algorithm also allows the user to choose the compression level by adjusting the
maximum number of detected peaks. A ceiling can be put on this, so that the data rate
of the new representation never exceeds the required amount.
3) Also, a psychoacoustic model is used to discard redundant information in each
frequency band thereby reducing data. This validates the need for frequency bands so
that the model can be applied to each band separately.
4) Variable frequency resolution is used to reduce the computational power required.
This method uses the J ND in frequency of the human ear discussed in chapter 3.
5) The DFT matrix is the inverse of the IDFT matrix. However it can be obtained from
the IDFT matrix without using matrix inversion. This observation is extended to the
algorithm being developed and the process of matrix inversion, which is required in
59
the algorithm by Dologlou et al. and Kotvis, is skipped. This results in increased
computational speed.
Since the filter bank is critical to all these improvements, this section begins with a
discussion on the filter bank used to separate the signal into bands. This is also the first
step of the algorithm.
The Analysis Filter Bank
When using filter banks, it is common to have an analysis as well as a reconstruction
section. Filter banks are discussed in 2.4 and figure 2.4.3 shows a typical analysis filter
bank which divides the signal into 10 octaves. In this study the signal is divided into six
bands and this is shown in figure 5.3.1. The reasons for dividing the signal into six bands
are discussed later. At each stage, there is a high pass filter (H1) and a low pass filter
(H0), both followed by decimation stages. As a result of decimating the signal in the
manner shown in figure 5.4.3, the same high pass and low pass filters can be used for
each stage though the cutoff frequencies in Hz at each stage are different. This is because
a typical digital filter is just a combination of coefficients that are adjusted such that there
is a cutoff frequency at some frequency between 0 (DC) and (Nyquist frequency). The
cutoff frequency in Hz depends on the sampling rate of the input signal. As an example,
if a filter as a cutoff frequency at 2 / , it means that for an input signal with a sampling
rate of 32kHz, the cutoff frequency is at 8kHz. But if the input signal has a sampling rate
of 8kHz, the cutoff frequency shifts to 2kHz. It turns out that because of this reason and
the fact that the signal is being decimated, the same high pass filter coefficients and low
pass filter coefficients can be used at each stage.
At each stage the signal is divided into a high-passed version that has frequencies from
2 / to and a low-passed version that has frequencies between 0 and 2 / . Due to
aliasing concerns during down sampling, it must be ensured that the attenuation for both
these filters is sharp and there is minimum energy leakage. The low-pass filter has to be
designed such that the attenuation at frequencies of 2 / and above is very high. The
same holds good for the high-pass filter with regard to frequencies of 2 / and below. In
60
the case of a perfect reconstruction filter bank, some aliasing is allowed during analysis
because the synthesis filter bank is designed to cancel it out. Since only the analysis
section is used, such perfect reconstruction filters are not required and higher order filters
with narrow transition bands can be used.
Figure 5.3.1 Filter bank dividing the signal into six bands
In this study, an eighth order elliptic IIR (Infinite Impulse Response) filter is used for
anti-alias filtering. Also, the zero-phase digital filtering technique is used to avoid phase
distortion, which is inherent in high order IIR filters. Zero-phase digital filtering
(J eong&Guon Ih, 1999) is conducted by processing the input data in both the forward as
well as reverse directions. After filtering the data in the forward direction, this filtered
sequence is reversed and run through the filter. The resulting sequence will have zero
phase distortion and double the filter order. The elliptic filter is chosen for its narrow
transition-band and high stop-band attenuation. The low-pass filter has a cutoff frequency
c
f =0.468 so that at the frequency of 0.5 , the attenuation is at least 100dB. The
attenuation of 100dB is chosen because 16-bit quantized signals have a dynamic range of
around 96dB. Similarly an eighth order elliptic high-pass filter is used to perform the
high pass filtering. It has a cutoff frequency at 0.532 . The magnitude responses for both
these filters are shown in figures 5.3.2a and b. The drawback in this method is that if
there are any frequencies that are equal to or in the close vicinity of 0.5 , they are
sharply attenuated. Since the zero-phase filtering technique is used, the phase response of
both filters is ultimately zero and is not shown. The coefficients are given in appendix A.
61
Figure 5.3.2 (a) Low-pass filter magnitude response and (b) High-pass filter magnitude
response
The sampling rate of the input signal is chosen to be 32 kHz. The filter bank divides the
signal into six sub-bands between DC and 16 kHz. The six sub-bands are as follows:
1) y6: 8 16 kHz
2) y5: 4 8 kHz
3) y4: 2 4 kHz
4) y3: 1 2 kHz
5) y2: 500 1000Hz
6) y1: 0 500Hz
The signals y1 y6 are the filtered versions of the input signal and contain frequency
information equivalent to what is shown above. It is important to note that due to the
decimation process these signals do not actually sound like filtered versions of the signal
because their spectrum has been mirrored. As an example, y6 contains information
between 8 16 kHz of the original signal but it is present in y6 between 0 and 8 kHz.
After analysis of this signal, a group of detected peaks between 0 and 8 kHz are detected.
But using a conversion formula, these are converted to their actual values between 8 and
16 kHz in the original signal. The same holds true for all the sub-bands.
62
Time-Frequency Blocks
The next step in the algorithm is to set up the time-frequency blocks for each sub-band.
As discussed earlier, the algorithm is flexible enough so that it can be adapted to the type
of audio signal being analyzed. Each sub-band signal is processed separately to give the
time-frequency distribution for that sub-band. The block (or frame or window) lengths
for all the sub-bands are in the region of 10ms 80ms. In general, the lower frequency
sub-bands are processed using longer block lengths than the higher frequency sub-bands
to reduce the spectral spreading problem discussed earlier. At the beginning of the
algorithm, there is a set of variables (including block length) that are set to certain
optimum values depending on the type of signal being analyzed. For example, the
following are the block lengths used for the six sub-bands of a signal shown in table 5.3.1
if the expected signal is of a single instrument type.
Sub-band Block length
chosen
Samples/block Number of
frequencies
used in
decomposition
Size of
transform
matrix
Y1 (0 500Hz) 80 ms 80 40 80x80
Y2 (500 1000Hz) 40 ms 40 20 40x40
Y3 (1 2kHz) 40 ms 80 40 80x80
Y4 (2 4kHz) 40 ms 160 80 160x160
Y5 (4 8kHz) 40 ms 320 160 320x320
Y6 (8 16kHz) 20 ms 320 160 320x320
Table 5.3.1
These block lengths are empirically derived by experimenting with various settings for
various instruments, and using the set that gives the best results. The size of the transform
matrix (the number of frequencies used for decomposition) depends on the block length
and the algorithm calculates the suitable transform matrix depending on the block length
chosen. The only constraint for choosing block length is that it must be a number that is
63
in the form of
n
2 5 milliseconds, where n is any positive integer. In other words, 5ms,
10ms, 20ms, 40ms, 80ms etc. are all suitable. This constraint is present because the signal
must be divided into an integer number of blocks and the algorithm is written so that if
these window lengths are used, the signal can be divided into an integer number of
blocks.
Maximum Frequency Peaks
Another set of variables which need to be set at the beginning of the algorithm are the
number of peaks that are adapted to in each block of each sub-band. The number of
frequency peaks that are detected depends purely on the music and the number of peaks
present in the DTFT. But the number of peaks that are actually retained after applying the
psychoacoustic model is less than that number. In some cases even this number is too
high. So, a ceiling is set on the number of peaks that are stored for every block of a given
frame. This variable is used to set the maximum number of peaks that can be stored for
every block of a given frame. Again this is set up depending on the type of audio signal
being analyzed and the amount of compression required. Knowing the expected audio
signal gives an estimate of which band contains the majority of the information and thus
this variable is set up accordingly. As an example, the maximum number of frequency
peaks to be stored for a signal that is expected to be a guitar chord is shown below in
table 5.3.2. In this study these numbers are derived empirically.
Sub-band Maximum frequency peaks
Y1 (0 500Hz) 6
Y2 (500 1000Hz) 6
Y3 (1 2kHz) 8
Y4 (2 4kHz) 12
Y5 (4 8kHz) 12
Y6 (8 16kHz) 0
Table 5.3.2
64
Variable Frequency Resolution
The next set of variables that need to be set up are the number of iterations that need to be
performed during adaptation. This is directly related to the frequency resolution of the
human ear described in chapter 3.2. It is seen from figure 3.2.1 that the human ear only
has a limited frequency resolution, which is different at different frequencies. As an
example, according to figure 3.2.1, at 500Hz, the frequency resolution of the average
human ear is around 4Hz. In other words, the human ear cannot distinguish between
500Hz and 504Hz. This in turn implies that while the algorithm adapts to some peak
frequency, and zooms in on the exact peak frequency in the DTFT, there is only a certain
level of accuracy required before the human ear fails to distinguish the difference (J eong
& J eong-Guon, 1999). This means that only a certain number of iterations are required
while adapting to the peak frequency before the human ear cannot distinguish the
difference. This in turn results in reduced calculations. For a given sub-band of
frequencies, the number of iterations required to get within a given range from the actual
peak frequency can be calculated. Consider the case of the signal being of the single
instrument type. From table 5.3.1, it can be seen that for the sub-band signal y2
consisting of frequencies between 500 and 1000Hz, 20 frequencies are used for
decomposition in each block. This means that the frequency separation between adjacent
frequency bins is roughly:
Hz f 25
20
500
20
500 1000


The frequency separation is very close to but not exactly equal to 25Hz because the bins
are separated by equal distances everywhere except the first and last bins as discussed in
the section Selection of Frequencies in chapter 5.2. This means that any located peak is
already within 12.5 Hz of the actual peak in the DTFT. The next iteration results in a
frequency, which is within half that distance from the peak frequency, and every
subsequent iteration during adaptation converges upon the peak frequency in the DTFT.
From figure 3.2.1 it can be observed that the frequency resolution of the ear between
65
500Hz and 1000Hz is between 4Hz and 5Hz. Even if a conservative estimate of 1Hz is
assumed, to make up for listeners with very good resolution, it is found that only 4
iterations are needed to get within 0.78125Hz of the actual peak frequency.
Using a similar criterion for each band of frequencies, a specific number of iterations
required per sub-band can be set up. This is known as variable frequency resolution
because the frequency resolution depends on the specific sub-band being analyzed. This
saves computation time compared to the method used by Kotvis, where every frequency
was adapted to using 10 iterations. An example is shown below in table 5.3.3 for the case
of the signal being of single instrument type.
Sub-
band
Frequency
Range
Number of
frequencies
in
transform
f
between
bins in
Hz
Frequency
resolution
of the ear
from
Figure
3.2.1
Conservative
estimate of
frequency
resolution
Number of
iterations
required
Y1 0 500Hz 40 12.5Hz 3Hz 0.5 Hz 5
Y2 500 1000Hz 20 25Hz 4Hz 1 Hz 4
Y3 1 2kHz 40 25Hz 5Hz 2 Hz 3
Y4 2 4kHz 80 25Hz 10Hz 4 Hz 2
Y5 4 8kHz 160 25Hz 20Hz 7 Hz 2
Y6 8 16kHz 160 50Hz - 10 Hz (by
extrapolation)
3
Table 5.3.3
The number of iterations required is thus calculated in the last column of table 5.3.3
depending upon the various factors shown, and set up at the beginning of the algorithm.
While adapting to a frequency peak, this reduces computations by a factor of between 2
and 4.
66
Setting up the Transform Matrix
The transform matrix, its creation and how it was used in the previous algorithms was
described in chapter 4. In brief, it involved choosing a set of frequencies, creating the
matrix containing the basis functions and then inverting it to get the kernel. This
procedure is derived from general matrix theory and is the correct way to create the
transform matrix. However there is an easier way to obtain the kernel. Consider the DFT
matrix and its inverse the IDFT matrix. The IDFT matrix contains the basis functions and
when it is inverted the DFT matrix is obtained, which contains the kernel. They are
related as:
1
DFT IDFT (5.3.1)
If x(n) is an N point sequence of data, then the DFT and the IDFT operations respectively
are as follows:

1
0
) (
N
n
mn
N
W n x DFT , m=0,1,,N-1
(5.3.2)

1
0
) (
1
N
n
mn
N
W n x
N
IDFT , m=0,1,,N-1
where
N j
N
e W
/ 2

These two operations are the inverse of each other and yet the only difference between
them is the sign in the exponent of
N
W and also an overall gain constant of 1/N. In fact if
the DFT and IDFT operations are performed on the same sequence of data, the only
difference between the magnitudes of the outputs of the two would be a scaling factor
1/N. This means that for sampling the DTFT of a given sequence, the IDFT
transformation could easily be used with a scaling factor of 1/N to get the same result as
67
the DFT operation. This is an important result and is used extensively during this study.
Since only the magnitude characteristics of the DTFT are important, while forming the
transform, the process of inverting the matrix can be skipped as demonstrated in chapter
4. Instead, the transformation matrix is formed by multiplying the basis function matrix
of size NxN with the scaling factor 1/N. Thus the procedure of matrix inversion is
avoided each time the transformation matrix is formulated. This greatly reduces
computational complexity.
Also, in the algorithm by Dologlou et al as well as the improved one by Matt Kotvis, the
criterion of least energy error is used to find the peak energy. For each step, the entire
transform matrix is recalculated by replacing just one basis function and then inverting
the matrix. Further, at each step energy error is calculated by applying this entire new
transform matrix and then using the formula for calculating energy error which is the sum
of energy in all bins except the present frequency bin. These calculations require a large
amount of processing power especially since they are done at every step and iteration
during adaptation.
The same results are obtained by noting that only the peaks in the DTFT need to be
found. This means that there is no necessity of calculating the entire transform at each
step and no need for calculating energy error during adaptation. The energy in the peak
frequency bin, which was found when the transform was used, is first stored. Then by
shifting the frequency in the direction where the energy increases, the peak frequency in
the DTFT can be converged upon. At each step it is only necessary to calculate the basis
function corresponding to the desired frequency (1 row of the transformation matrix).
The energy in that one frequency bin is calculated and then compared to the previous
case. In fact this procedure is very similar to the one outlined in the first section,
Adaptation in chapter 5.2.
68
The following then, is a summary of the initialization procedure including user input for
the algorithm.
1) Enter the type of signal being processed (musical piece, single instrument, voice?)
2) Depending on the type of signal, certain parameters that were found to be optimum
are already set up for each type of signal, but can be changed if required. The
following are the parameters:
a) Block length for each sub-band: blocklength_1 blocklenth_6
b) Number of iterations during adaptation for each sub-band: iteration_1
iteration_6
c) Maximum number of frequencies to adapt to for each sub band: peaks_1
peaks_6
d) Type of window to be used (if other than a rectangular window).
3) Check if the total number of samples in the input signal is a multiple of 1280. This is
required for decimation to be done properly. If it is not a multiple, pad zeros equally
on the left as well as right, so that it is a multiple.
4) Divide the signal into 6 sub-bands using the filter bank and store these sequences (y1
y6)
5) Consider the block length for each sub band and call a function that creates the proper
transform matrix based on the block size required for each band. The 6 transform
matrices are stored.
6) Process each block of each sub band using the proper transform matrix and then apply
the adaptation procedure specified in the next section.
Adaptation Procedure
This section describes the adaptation procedure for a single block of a given sub-band.
After setting up all the variables as well as the transform matrix as outlined in the
previous section, adaptation can begin. Figure 5.3.3a outlines the initialization procedure
for a single block of data and figure 5.3.3b demonstrates the adaptation procedure for a
single frequency peak.
69
) (n y
YES
NO
Figure 5.3.3a Initialization procedure for the new adaptation algorithm
(Already set up)
block length =bl
Iterations =i
Maximum Peaks =p
Apply a window on
the block if required
Apply the transformation matrix already
calculated on the block to get the transformed
sequence (spectrum) ) (m Y
Detect peaks in ) (n Y . A peak is defined as a point
with greater magnitude than both its adjacent points
Are total
number of
peaks
detected >p?
Apply psychoacoustic
model to discard redundant
peaks
Perform adaptation for the
remaining peaks
Sort the peaks in order of
magnitude and keep only the
highest magnitude peaks
70
YES
NO
K is Positive K is Negative
K =zero
Figure 5.3.3b Actual Adaptation
Diff=0.001Hz
fd=(frequency bin spacing)/2
newfreq=freq
newenergy=energy
oldenergy=energy
mult=1
Is
mult=0?
Do i times
B
plusfreq=newfreq+diff
plusenergy=energy in the
DTFT at frequency plusfreq
minusfreq=newfreq-diff
minusenergy=energy in the
DTFT at frequency minusfreq
K=
plusenergy-minusenergy
fd=fd/2
newfreq=newfreq+fd
newenergy=energy in
the DTFT for newfreq
fd=fd/2
newfreq=newfreq-fd
newenergy=energy in the
DTFT for newfreq
Both energies are the same.
mult=0
Return newfreq and
newenergy. EXIT. END
A
B
A
71
In figure 5.3.3a, the initialization procedure for the adaptation algorithm is shown. In the
previous section, the calculation of the parameters for block sizes, maximum numbers of
frequencies and number of iterations for each block of each sub-band are shown. These
are the starting inputs for the initialization section. The signal (block) is fed into the first
section. If required, the block of data is windowed using a special window such as a
Hamming, Hann or Blackmann window. The transformation matrix (which has already
been calculated) is then applied on this (windowed) block of data to obtain the
transformed sequence.
In the next step, a peak detector is applied on the magnitude characteristics of this
transformed sequence. The peak detector detects and stores all the frequency peaks and
their energies as found in the transformed sequence. These are the frequencies that are
actually present in the block of data but are still inaccurate. Adaptation is performed for
each peak to find a more accurate estimate of the actual peak in the DTFT. If the number
of peaks exceeds the maximum number of peaks set up at the beginning, only the peaks
that are highest in magnitude are kept to remain within this constraint. A psychoacoustic
model is then applied to further discard redundant frequencies. In general, any
psychoacoustic model may be applied to discard peaks in an intelligent manner. At
present, only one psychoacoustic criterion is being used. From figure 3.2.2, it is seen that
a 50dB tone of frequency 415Hz at the center of an octave successfully masks
frequencies in that octave whose magnitudes are more than 20dB below it. Each sub-band
consists of an octave of information. All frequencies in that block in that sub-band that
are more than 20dB in magnitude below the highest peak frequency in that block are
discarded. Thus the number of peaks are further reduced. The next step is to adapt to each
one of these peaks.
The adaptation process for a single peak is shown in figure5.3.3b. To better understand
this procedure, the following is an outline of what each variable stands for:
72
Variable Explanation
Energy Energy in the frequency peak before adaptation
Freq Frequency to be adapted (frequency of the peak bin)
Newfreq Frequency variable that is moved continuously in a binary search towards
the actual peak
Fd Incremental amount used to change newfreq at each iteration. It reduces
by half after each iteration
Oldenergy Energy in the DTFT for the frequency corresponding to the previous stage
of adaptation
Newenergy Energy in the DTFT for the frequency corresponding to the present stage
of adaptation
Diff A minute incremental value to increase and decrease newfreq at each
iteration to find out in which direction the peak is
At the end of the adaptation, newfreq is returned as the adapted peak frequency and
newenergy is returned as the adapted peak energy. The algorithm starts with more
initialization. The next step is the iteration loop which is set to perform i times as set up
earlier. This is followed by a check for a variable named mult. As long as mult is not
zero, the iteration continues till the end, but if mult is zero, it exits and returns the
frequency and energy found during the present iteration of adaptation. The reason for this
is that mult is set to zero later if it is found that newfreq is exactly on the DTFT peak. The
next step is to increment the newfreq by a small amount and find the energy in the DTFT
at that frequency. This is done not by recalculating the whole transform (including the
inverting procedure) and applying it on the signal, but by only changing the relevant rows
in the present transform according to this new frequency. Then, instead of applying the
whole transform on the signal as was done in previous papers, only the relevant rows are
applied and the new energy is calculated. The same is done for a similar decrement in
frequency. The two energies thus obtained are compared. If the first energy is greater
than the second, then newfreq is incremented by fd/2. If the second energy is greater than
the first, then newfreq is decremented by fd/2. Thus by reducing fd by a factor of two
73
during each iteration and continuously moving towards the direction of greater energy, it
is possible to be within the required range of the actual peak in the DTFT. This is how
adaptation is carried out for a single frequency. This procedure is repeated for every peak
in every block of every sub-band and these peaks and their energies are stored. The next
step is to arrange them in a time-frequency matrix.
5.4 - TIME-FREQUENCY REPRESENTATION
As described above, all the frequency peaks for each block of each sub-band and their
energies are stored. The reader will recall that due to decimation, the spectral
characteristics of each band were mirrored around their new Nyquist frequency after
decimation. This means that a conversion is required to convert the detected frequencies
into the actual frequencies present in the original signal sampled at 32kHz. The formula
for conversion is as follows:

,
_

2
1 .
2
2 .
6
dig
octave
s
act
f
F
f (5.4.1)
dig
f is the digital frequency that was located and adapted to in the decimated signal and
act
f is the actual value of this frequency as present in the original signal. Using this
equation every adapted peak frequency is converted back to its actual frequency. Now the
adapted frequency peaks of each sub-band as well as their magnitudes are arranged in
two separate matrices. They are both arranged so that every column corresponds to a
block or frame in time. If there are N blocks that were analyzed, then there are N columns
in the matrix. In the first matrix, each column contains the actual adapted frequency
peaks that were found in that block. In the second matrix, each column contains the
corresponding magnitudes of the adapted frequencies. Together, these two matrices form
a time-frequency representation for the music. The music is later re-synthesized using the
data in these time-frequency matrices. This is covered in the next chapter.
74
Chapter 6 Synthesis Using Spectral Interpolation
Having created the time frequency matrix as described in the previous chapter, the next
step is to test the quality of this alternative method of representation. It might achieve
good compression, but the question is how good it sounds when it is re-synthesized.
There are a variety of ways to re-synthesize the music using the time frequency matrix.
However not all of them sound equally good. The choice of the re-synthesis method is
therefore not straightforward and a method must be found that is best matched to the
unique sinusoidal model used for analysis. A method that uses spectral interpolation
between frames is chosen because it is found to give the best results.
6.1 WHY SPECTRAL INTERPOLATION?
There are a variety of methods that can be used for re-synthesizing the music based on
the time-frequency matrix and some of these are outlined in chapter 2.5. On a basic level
the model of additive synthesis shown in figure 2.5.3 is used. Each sub-band has its own
time-frequency matrix with possibly differing block lengths for different sub-bands. Each
time-frequency matrix has consecutive columns representing consecutive frames. Each
frame contains certain frequencies in the frequency matrix and their corresponding
magnitudes are in the magnitude matrix. Since the frequency and magnitude parameters
change frame by frame, the additive synthesis model shown in figure 2.5.3 could be used
by simply changing the parameters at each frame. But the problem with such a simple
method is that it doesnt take into account the ending phase of the synthesized signal
from one frame and the starting phase of the synthesized signal in the next frame. When
these two frames are synthesized separately and simply pasted together, the result is
almost certainly a discontinuity at every frame boundary. This is heard as a pop at every
frame and when played back at the correct rate is heard as a continuous scratching sound.
As an example, consider a signal with 32kHz sampling rate consisting of three
75
frequencies. The three frequencies are 1000Hz, 1250Hz and 1500Hz and they are present
in equal proportion. Assume that this signal is analyzed in blocks of 25 ms, which
correspond to 800 samples per block. Focus on the first two blocks consisting of a total of
1600 samples. This is shown in figure 6.1.1.
Figure 6.1.1 Original signal containing three frequencies segmented into two blocks
When these blocks are analyzed, to form a time frequency matrix containing two columns
(for two blocks), it is found that both the blocks have the three frequencies at equal
magnitudes. Assume for the sake of simplicity that the analysis was perfect and that the
frequencies found were exactly 1000Hz, 1250Hz and 1500Hz at a magnitude of 1 each.
This can be represented in two time-frequency matrices as follows:
Block 1 Block 2
Frequency 1 1000Hz 1000Hz
Frequency 2 1250Hz 1250Hz
Frequency 3 1500Hz 1500Hz
Table 6.1.1a Frequency matrix for the first two blocks
76
Block 1 Block 2
Magnitude 1 1 1
Magnitude 2 1 1
Magnitude 3 1 1
Table 6.1.1b Magnitude matrix for the first two blocks
If the signal for block 1 is simply synthesized with frequencies and magnitudes as shown
using additive synthesis and then similarly, a signal is synthesized for block 2, two
synthesized signals of 800 samples each are obtained for the two blocks. To obtain the
full re-synthesized signal, these blocks are simply pasted next to each other. This re-
synthesized version is very similar in characteristics to the original except that between
the 800
th
and the 801
st
sample, there is a change in phase, creating a discontinuity which
results in a pop when it is played back. The last 100 samples out of the 800 synthesized
samples of block 1 are shown in figure 6.1.2a.
Figure 6.1.2a The last 800 samples of the synthesized block 1
77
Figure 6.1.2b The first 100 samples of the synthesized block 2
The first 100 samples of the 800 synthesized samples of block 2 are shown in figure
6.1.2b. The samples in the vicinity of the 800
th
sample of the concatenated signal are
shown in figure 6.1.2c.
Figure 6.1.2c The samples around the 800
th
and 801
st
sample of the concatenated
synthesized blocks
78
As can be seen from figure 6.1.2c, there is a discontinuity at the point where the two
separately synthesized blocks are concatenated even for this simple case. In an actual
music signal, it is found that there is usually a slight change in a given frequency between
blocks. As an example, three consecutive blocks may contain the frequencies 500Hz, 505
Hz and 495Hz. This is to be expected as discussed in chapter 2.6. Also, the number of
frequencies in adjacent blocks may not be the same due to detection of new frequencies
as the characteristics of the music change. Under these circumstances there will be severe
discontinuities leading to severe distortion if this simple method is used.
Another way of dealing with this situation is by cross fading between adjacent blocks and
this was mentioned in chapter 2.5.5. This is not a very elegant technique because it only
eliminates the discontinuities but does not join the two blocks in a natural way. Typically
two adjacent blocks will contain very similar, but slightly different frequencies (example
500Hz and 505Hz). If the cross fading technique is used to concatenate the two
synthesized blocks, there will be a period where both the 500Hz and the 505Hz signal are
present in almost equal proportion. This will result in a frequency beating at 5Hz, which
is highly undesirable. If this phenomenon takes place for almost every frequency at each
frame, the result will be a distorted output.
From the study quoted in chapter 2.5 (J ean Claude Risset, 1966), it is known that what is
more likely happening is that the frequency of 500Hz originating from some instrument
is changing to 505Hz over the course of the block due to vibrato and other such effects. It
is therefore preferable to view two such similar frequencies in adjacent frames as being
related to the same source. A more desirable way of synthesizing the output is to
synthesize a frequency of 500Hz, which slowly increases over the course of the frame
and becomes 505Hz by the end of the frame. This may not necessarily be the way in
which it happened in the original signal. For example, the 500Hz frequency could have
been stable during the first half of the frame and then rapidly risen to 505Hz in the
second half of the frame in the actual signal. There is no way of knowing this though,
because the frame is analyzed as a whole and it is impossible to know about such changes
79
which take place within the frame itself. The best that can be done is to estimate what
happened within the frame and generate a tone whose frequency smoothly changes from
500Hz to 505Hz over the course of the frame. According to the study by Risset quoted in
chapter 2.5, it is also necessary to simulate these frequency variations in the synthesized
tone for overall quality. One way this could be done is by linearly increasing the
frequency from 500Hz to 505Hz over the course of the frame. This is known as linear
spectral interpolation. The problem with this is that over the course of many frames the
slope of the linear interpolation keeps changing giving rise to discontinuities in the curve
of frequency versus time. Since frequency is a derivative of phase, this could lead to
discontinuities in the curve of phase versus time, if it is not handled carefully. This
implies that there are discontinuities in the synthesized signal. Consider table 6.1.2 as an
example. This table shows the variations in frequency over four frames for a single tone
that needs to be generated. The variations considered are quite extreme, but this is only to
demonstrate the method. This table is an example of a simple time-frequency matrix.
Block 1 Block 2 Block 3 Block 4
Frequency 500Hz 600Hz 400Hz 500Hz
Table 6.1.2 A simple time-frequency matrix (magnitudes not shown)
If the synthesized tone is generated using linear frequency interpolation, the curve of
frequency versus time shown in figure 6.1.3 is used.
ft phase 2
s =sin(phase) (6.1.1)
This equation holds good for a given point in time t where the frequency is found to be
f . Using these two equations, the output signal s can be synthesized. The output signal,
which is synthesized from a discontinuous frequency curve, is also discontinuous at the
same points that the frequency curve is discontinuous. The second discontinuity is more
obvious and is indicated in figure 6.1.4.
80
Figure 6.1.3 Frequency Vs Time curve using linear frequency interpolation
Though the frequency variations shown are quite extreme, the principle remains the same
and linear frequency interpolation can cause discontinuities in the synthesized signal if
phase issues are ignored. This problem could be solved by considering a method in which
the starting and ending phases are taken into account for each frame. A cubic phase
interpolation algorithm is used to achieve just this. The basics of this algorithm are
discussed in the paper by McCaulay & Quatieri, 1986. In order to use this algorithm
though, the time-frequency matrix must be analyzed and grouped into sets of frequencies
from frame to frame that are related to each other to form a frequency track. This is
explained further in the next section.
81
Figure 6.1.4 Synthesized signal with discontinuities
6.2 FRAME TO FRAME PEAK MATCHING
If the number of peaks detected were constant from frame to frame and all the peaks in
one frame were related to the peaks in the next frame (e.g., 500Hz in the first frame and
505Hz in the next frame, 810Hz in the first frame and 805 Hz in the next frame), there
would be no problem of matching parameters from one frame to the next. In reality, there
is side-lobe interaction, which causes spurious peaks, and also there is vibrato effect as
well as the dynamic nature of the music itself which all result in a time varying spectrum.
Typically adjacent frames neither have the same number of peaks nor do they have all the
frequencies related to each other. At this point the matrix can be viewed as being one in
which columns of elements are grouped together since these are basically the frequencies
found in each frame. The aim is to find frequencies between frames that are related and to
arrange all these sets of related frequencies in rows, so that each of these rows is a
frequency track. As an example consider the time-frequency matrix in table 6.2.1. Each
column contains the frequencies identified in that frame in ascending order of frequency.
82
Frame 1 Frame 2 Frame 3 Frame 4
Frequency 1 500 505 405 405
Frequency 2 805 1040 504 502
Frequency 3 1050 1502 1040 1045
Frequency 4 1500 ---- 1505 1500
Table 6.2.1
An algorithm must be developed, which sorts this table in an optimum way so that a more
useful table is obtained that has the frequencies arranged in tracks. In fact table 6.2.1
must be sorted so that it resembles table 6.2.2.
Frame 1 Frame 2 Frame 3 Frame 4
Track 1 0 0 405 405
Track 2 500 505 504 502
Track 3 805 0 0 0
Track 4 1050 1040 1040 1045
Track 5 1500 1502 1505 1500
Table 6.2.2
This table should be viewed not as a collection of columns that contain the frequencies
found in each frame but rather as a collection of rows each containing a frequency track.
Notice that there are now more rows than before, but each row or track can be
synthesized with a tone of frequency varying according to the frequencies indicated in
that row (and also the magnitudes indicated in the magnitude matrix not shown here).
The final synthesized output is the sum of the individual synthesized outputs for each
track. In the case shown in table 6.2.2, there are 5 synthesized tones for 5 tracks, which
are added up in the end to give the total synthesized output. This new matrix can be
83
referred to as the matched matrix because the frequencies have been matched to create
frequency tracks.
The reason why a frequency in a given track is selected to be in that track is because it
satisfies some criterion for proximity with some other frequency in the previous frame of
that track. For example, in the example shown in table 6.2.2, the frequency of 505Hz is
chosen to be in the second frame of the second track because it satisfies a criterion of
proximity (e.g., the two frequencies are within 10 Hz of each other) with the frequency of
500 Hz in the previous frame. Obviously, it is a better match for the second frame in that
track as compared to the frequency of 1040 Hz, which is matched to track 4 in the same
frame. As an example, one condition that could be used is that for a certain frequency to
be chosen to be in a certain track, it must be within t 10 Hz of the previous frequency in
that track. If this condition alone is used, and if there is a frame where there is more than
one frequency which satisfies this condition in the second track, then this condition alone
is insufficient to perform matching. Also, if there is a frequency in the second track,
which satisfies the condition for more than one frequency in the first track, then this
condition alone, is again insufficient. A set of conditions must be used to find an
optimum way for matching frequency peaks from frame to frame. A three-step procedure
for doing this is outlined below.
It is assumed that for a certain frequency track, matching has been performed up to the k
th
frame and a match has to be found in the (k+1)
th
frame. Frequencies are denoted in the
form of
x
y
w , where x is the frame number and y is the frequency number in that track.
The frequency being considered is
k
n
w , which is the n
th
frequency in the k
th
frame. A
match needs to be found for this frequency in the (k+1)
th
frame. The following is the
three-step procedure:
Step 1
First of all, it is important to remember that when discussing the creation of the new
matched matrix containing frequency tracks, there are actually two matched matrices
one containing frequencies and the other containing magnitudes. Assume that there are
84
totally N frequency peaks in the k
th
frame and M frequency peaks in the (k+1)
th
frame.
The best frequency peak match in the (k+1)
th
frame must be found for the n
th
frequency in
the k
th
frame, which is
k
n
w . To begin with, some matching interval is set up, which
is equivalent to some frequency spacing in Hz within which a potential match must lie
with respect to
k
n
w to possibly be matched to it. All potential matches in the (k+1)
th
frame
are found by using the following criterion:

+1 k
m
k
n
w w , M m 1 (6.2.1)
This condition is checked for all the M frequencies in the (k+1)
th
frame (except those that
are already matched to some other frequency in the k
th
frame) and all the frequencies that
satisfy this criterion are stored. If no frequency satisfies this condition, then that
frequency track is considered dead and it ends at that frame. If this is the case, the
frequency is matched to itself in the next frame of the matched matrix containing
frequencies but with zero magnitude in the matched matrix containing magnitudes. This
is required because during synthesis, when a track dies in the k
th
frame with a frequency
of f Hz, it is synthesized with a linear fade-out effect starting from frame k to the
(k+1)
th
frame. This is easily done by inserting exactly the same frequency in frame k+1
with a magnitude zero, so that the algorithm automatically performs the linear fade-out
effect. If instead of inserting the frequency of f Hz in the frame k+1, it is left with 0 Hz,
the algorithm not only fades out this last frequency but also changes it from f Hz to 0Hz,
which is undesirable. The remaining frames in the matched matrix containing magnitudes
are filled with zeros and step 2 can be skipped. If one or more frequencies are found that
are within this matching interval, and are therefore potential matches, one strategy would
be to simply match the frequency that is closest to
k
n
w and choose it as the best matched
frequency. Although this frequency
1 + k
m
w is the closest match for
k
n
w in the (k+1)
th
frame,
there could be another frequency in the k
th
frame other than
k
n
w , which is a better match
for
1 + k
m
w . So for optimum frequency peak matching, frequencies are checked for best
matching in both the next and the previous frame. This is described in step 2.
85
Step 2
By the time this step is reached, one or more potential matches in the (k+1)
th
frame have
already been found for the frequency
k
n
w in the k
th
frame. As discussed in step 1, some of
these potential matches in the (k+1)
th
frame may have better matches in the k
th
frame
than
k
n
w . All potential matches
1 + k
m
w are discarded if they cannot satisfy the criterion in
equation 6.2.2.
k
i
k
m
k
n
k
m
w w w w
+ + 1 1
for N i n < < (6.2.2)
Here
1 + k
m
w is the potential match in the (k+1)
th
frame. i must be greater than n because
all previous frequencies in the k
th
frame are presumed to be already matched and there is
no need to check them. Only potential matches that satisfy equation 6.2.2 are stored and
the rest are discarded. If there is only one frequency left, it is matched to
k
n
w . If there is
more than one frequency, the one that is closest to
k
n
w is matched to it. If there are no
frequencies left that can be matched to
k
n
w , that track is considered dead. This procedure
is repeated for every frequency in the k
th
frame so that each frequency is matched with
some frequency in the (k+1)
th
frame so that the track continues, or else if no match is
found, the track dies. It should be noted that many other situations are possible and could
be considered for better optimization, but to keep the tracker alternatives, only the
situations described above are considered.
Step 3
When all the frequencies in frame k have been tested for matches in the next frame and
have contributed to either tracks which continue or are dead, there may be some left over
frequencies in the (k+1)
th
frame that have not been matched to any frequency in frame k.
In this case, these frequencies are considered as the first frequencies of a track which
begins in the (k+1)
th
frame. In other words a track is born in the (k+1)
th
frame. In this
case a new frequency is created in frame k with zero magnitude. This is done for the same
86
reason described in step 1 with regard to a track that dies. In this case, when a track is
born, a similar fade-in effect is used during synthesis. An illustration of the birth/death
procedure is shown in figure 6.2.1. The result of applying the tracker to a segment of real
speech is shown in figure 6.2.2. This figure illustrates the ability of the tracker to adapt
quickly to voiced and unvoiced regions.
Figure 6.2.1 Illustration of the birth/death procedure
(Taken from McAulay & Quatieri, 1986)
Figure 6.2.2 Typical frequency tracks for real speech
(Taken from McAulay & Quatieri, 1986)
After this procedure is performed for all the elements in the time-frequency matrix, a
matched matrix is obtained. All that remains to be done is to synthesize each row (track)
of this matched matrix and add up all these synthesized tracks to produce the final re-
synthesized output.
87
6.3 SYNTHESIS USING CUBIC PHASE INTERPOLATION
In this section, the actual synthesis method, which uses the matched frequency matrix
obtained from chapter 6.2 is described. Cubic phase interpolation is used for obtaining
the smooth spectral interpolation described in previous chapters. The reasons for using
this spectral interpolation algorithm were discussed in chapter 6.1. The method used for
synthesis is now derived.
In the previous section, only the frequencies and the magnitudes were stored in the form
of matrices. The phase associated with each frequency was discarded. In this section, it is
assumed for the sake of generality that the phase associated with each frequency is also
known and stored. It is easy to modify the algorithm for the case where phase is not taken
into account. The parameters associated with each frequency are the frequency in radians,
the magnitude and the phase, denoted by w, A, and respectively. Assuming that
interpolation is to be performed between frequency l in the k
th
and the (k+1)
th
track, the
parameters associated with these frequencies are ) , , (
k
l
k
l
k
l
w A and ) , , (
1 1 1 + + + k
l
k
l
k
l
w A
respectively. A linearly interpolated envelope is used to solve the magnitude interpolation
problem. In other words, the magnitude is interpolated from frame to frame using
equation 6.3.1
n
S
A A
A n A
k k
k
.
) (
) (
1

+
+
, 1 ,...., 1 , 0 S n (6.3.1)
Here n is the time sample into the k
th
frame and S is the total number of samples per
frame. (The track subscript l has been omitted for convenience). Unfortunately this
simple approach cannot be used to interpolate between frequencies due to the
discontinuities which occur as described in chapter 6.1. Also the measured phase is
modulo 2 and hence phase unwrapping must be performed to ensure that frequency
88
tracks are maximally smooth. Since a cubic equation is used for interpolating phase,
the first step is to propose a function for phase that is a cubic polynomial.
( )
3 2
. . . t t t t + + + (6.3.2)
It is convenient to treat the phase as a function of the continuous time variable t with
respect to some given frame. This phase function must vary smoothly with time so that
the output for that track which is given by equation 6.3.3 also varies with time as
smoothly as possible.
)] ( cos[ ) ( t t s
track
(6.3.3)
It is important to note at this point that phase is directly related to instantaneous
frequency by the relation in equation 6.3.4.
dt
t d
t f
)] ( [
) (

(6.3.4)
where ) (t f denotes instantaneous frequency.
In other words, instantaneous frequency is a derivative of phase. Using this relation and
applying it to equation 6.3.2, the following is obtained:
2
. . 3 . . 2 ) ( t t t w + + (6.3.5)
At the starting point of the frame, where t=0,
k
) 0 (
(6.3.6)
k
w w ) 0 ( ) 0 (
!
89
Here denotes phase and
!
denotes the first derivative of phase. In other words, the
starting phase of a given frame is the phase that is associated with that frequency for that
frame and it is equivalent to the variable . The starting frequency of that frame which is
actually the frequency detected in that frame is equivalent to the parameter . These two
variables can therefore be set directly with the known information. The variables and
now need to be solved for. Consider the terminal point of the frame where t=T:
M T T T w T
k k k
. . 2 . . . ) (
1 3 2
+ + + +
+
(6.3.7)
1 2
. . 3 . . 2 ) ( ) (
+
+ +
k k
w T T w T w T
!
(6.3.8)
Equation 6.3.7 is based on the fact that at the end of the frame the phase should be equal
to the total unwrapped phase found for the next frame. The unwrapped phase is actually
equivalent to the phase found which is
1 + k
added to some integer multiple of . 2 which
directly related to how many cycles were completed in that frame for that track. Equation
6.3.8 is based on the fact that at the end of the frame, the instantaneous frequency should
be equal to the beginning frequency of the next frame. There are now two equations in
and , and the only variable which is still unknown is M which basically stands for the
number of cycles completed by the frequency track in that frame. At this point, M is
unknown, but the equation can be solved for any given value of M using:
1
1
]
1

+
1
1
1
1
]
1

1
]
1

+
+
k k
k k k
w w
M T w
T T
T T
M
M
1
1
2 3
2
. . 2 .
1
,
2
1
,
3
) (
) (

(6.3.9)
Any M that is suitable can be used to solve for all the required parameters. Some
constraint must be applied to solve for a value of M that will suit the algorithm. At
present there are a family of curves that produce the required result in terms of starting
and ending frequencies and phases. Since, interpolation between frames needs to be as
smooth as possible, the criterion of maximal smoothness is used to solve for M. It
90
seems obvious that the best phase function to choose would be the one that resulted in
least variation in frequency. Therefore a reasonable criterion for smoothness of phase is
that equation 6.3.10 is minimized.

T
dt M t M f
0
2
)] : ( [ ) (
! !
(6.3.10)
Here ) : ( M t
! !
denotes the second derivative of the phase ) : ( M t with respect to the
time variable t and is an indicator of rate of change of frequency. The right hand side of
the equation basically finds the area under the curve of the square of the rate of change of
frequency for that frame for a given value of M. When this area is minimized, the
variation in frequency over the whole frame is minimized. The value of M that minimizes
this equation is chosen. Although M is integer valued, since ) (M f is quadratic in M, the
problem can easily be solved by minimizing ) (x f with respect to the continuous variable
x and then rounding it off to the closest integer to get M. It can be shown that equation
6.3.10 can be minimized using the following:
1
]
1

+ +
+ +
2
). ( ) . (
. 2
1
1 1
T
w w T w x
k k k k k

(6.3.11)
M is determined by rounding off x to the closest integer and substituted in equation 6.3.9
to solve for ) (M and ) (M which are basically and values for that particular
value of M. Having solved for and , and knowing and as described above,
these coefficients are now used to construct the phase function for that frame using:
3 2
). ( ). ( . ) ( t M t M t w t
k k
+ + + (6.3.12)
This phase function not only satisfies all the measured phase and frequency end point
constraints but also creates a phase function that is maximally smooth. In chapter 6.2, the
topic of inserting a frequency
k
w in the previous frame when a frequency
1 + k
w is born in
91
some frame and inserting it in the next frame when it dies was briefly discussed. This is
done by setting up the frequency
k
w =
1 + k
w and setting amplitude in that frame to zero
( ) 0
k
A . The initial phase is calculated for this frame to ensure that the phase constraints
are satisfied at the end of this frame or the beginning of the next frame. Since only the
magnitude has to rise but the frequency has to stay constant in this initial frame, the value
of starting phase is calculated, which forces that to happen. Though the starting and
ending frequencies of this frame are the same, the algorithm could introduce slight
changes in frequency over the course of the frame to satisfy some phase constraints. The
value of phase, which will introduce no change in frequency over the course of the frame,
is calculated by using:
S w
k k k
.
1 1 + +
, (6.3.13)
where S is the number of samples per frame. Every frequency track (described in chapter
6.1) in the given matched matrix can be synthesized and then added up to give the final
output using:

L
l
l l
n n A n s
1
)] ( cos[ ). ( ) ( (6.3.14)
where ) (n A
l
is estimated from equation 6.3.1, ) (n
l
is calculated using equation 6.3.12
and L is the number of frequency tracks in that matrix.
In this study, the phase information is not important and certain alterations to this
algorithm can be made. Since phase information is not used, the value of used for the
synthesis of each frame is not important. could possibly be set to 0 for each frame
thereby forcing the phase to be zero at the beginning of each frame, but this does not
offer any advantages as such. To better understand how the value of so that it is useful
in this study, consider the effect of using different values of M. Figure 6.3.1 shows the
variation of phase over the course of a frame for different values of M.
92
It should be noted that all the curves shown in the figure satisfy the required conditions of
starting and ending phases as well as starting and ending frequencies. They all also result
in smooth curves without discontinuities at frame boundaries. The only difference
between these curves is the variation of frequency during the course of the frame.
Depending on the value of M, some of the curves have a large variation in frequency in
the course of the frame (e.g. the case where M =0 in figure.6.3.1) and some have very
little variation (e.g. the case of M =2 in figure 6.3.1). The figure shows the variation in
phase, but since instantaneous frequency is just a derivative of phase, an almost linear
phase curve indicates very little change in frequency over the course of the frame.
Figure 6.3.1 Variation of phase over one frame for different values of M using cubic
phase interpolation
(Taken from McCaulay & Quatieri, 1986)
When x is calculated as shown in equation 6.3.11, a value is obtained, which if
substituted for M in equation 6.3.9, produces values of and which give the
smoothest or most linear curve in phase which in turn translates to the smoothest curve in
93
frequency. This value could not be used previously because M was required to be an
integer so that the starting and ending phases required by the stored phase information
could be obtained. So, x was rounded off to the closest integer to get M. Now that there
are no phase concerns, the value x is used without round-off to get the smoothest
possible curve. The starting and ending phases need careful handling though. In the very
first frame, a starting phase of zero is assigned. Since a non-integer value x is used,
instead of M, a non-zero-ending phase is obtained for the first frame. The signal is
synthesized so that the starting phase of the next frame is assigned to the ending phase of
the first frame. This is done in every subsequent frame. In this manner, there are no
discontinuities and the smoothest possible phase curve as well as frequency curve over all
the frames is obtained. The final synthesis is done in the same manner as before, using
equations 6.3.12 and 6.3.14. A model for additive synthesis was shown in figure 2.5.3.
With this method, a slightly changed model is proposed in which the frequencies and
amplitudes for each block are not fixed. The amplitude envelope varies in a linear fashion
for each frame according to equation 6.3.1 and the frequencies vary smoothly using the
cubic phase interpolation algorithm. The output is the sum of all the tracks generated.
Also to be noted is the fact that each sub-band is synthesized separately using the model
shown above. The final output is the sum of all the re-synthesized sub-bands. If the
synthesized sub-bands are y1, y2,, y6, the total output is given by:
) ( 6 ) ( 5 ) ( 4 ) ( 3 ) ( 2 ) ( 1 ) ( n y n y n y n y n y n y n y + + + + + (6.3.15)
) (n y is the re-synthesized output signal.
94
Chapter 7 Results and Conclusions
The methods described in previous chapters were used to analyze and represent various
kinds of audio signals. To test the effectiveness of this representation, re-synthesis was
performed so that the original version could be compared to the synthesized version. The
algorithm allowed flexibility for setting up some of the parameters. The number of sub-
bands into which the signal was divided was fixed at six. The window lengths within a
sub-band were not variable, but each sub-band could be set-up to use any desired window
length (as long as it was in the form
n
2 10 ms). Also the maximum number of peaks
within a sub-band could be fixed according to the type of audio signal being analyzed
(found empirically) and the amount of compression required. Another parameter that
could be set was the number of iterations performed during adaptation and this could be
set to a different value for different sub-bands (depending on the frequency resolution of
the human ear). Different settings revealed different results for different types of audio
signals. Four audio signals were analyzed and re-synthesized. They were as follows:
a) A classical music piece
b) A guitar chord
c) A clarinet patch from a synthesizer
d) Speech
A significant assumption in the analysis/synthesis procedure was the fact that the phase
information found during analysis was not required during synthesis and was therefore
discarded. In order to validate this, listening tests were performed on seven subjects to
determine the effects of including and omitting phase. The results of these listening tests
are shown in chapter 7.5. Also, as mentioned in previous chapters, one of the important
improvements in this algorithm was that of using frequency resolution of the human ear
to reduce the number of iterations performed and thereby reducing computational
complexity. The frequency resolution of the human ear though, was calculated using
graphs that showed frequency resolution of the ear for single tones. Practically these
95
values could be different when the ear is presented with complex tones. Listening tests
were performed with varying numbers of iterations to verify that the values used were
good estimates. The results are shown in chapter 7.5.
The remaining parameters of block size and maximum number of frequency peaks
were determined empirically to produce the best results for the given data. The
parameters used for each audio signal as well as the results of analysis/re-synthesis are
now provided. Also, the details of the computational power that was saved as compared
to the Matt Kotvis algorithm are provided. The main reasons why computational
complexity was reduced were that matrix inversion was avoided as well as the fact that
the number of iterations performed while adapting to a peak was reduced. It should be
noted that each iteration involves sampling the DTFT of the signal three times. The
amount of computational power required for matrix inversion depends on how well the
process is optimized. The exact numbers of cycles saved are therefore not calculated but
the details of the processes saved are listed below.
7.1 CLASSICAL MUSIC
The piece of music chosen for this category was a violin concerto by Bruch (track 5). The
set of parameters that gave the best results is shown in table 7.1.1.
Block Block Size
(ms)
Block Size
(samples)
Iterations Max. Frequency
Peaks
0 500 Hz 80 80 5 6
500 1000 Hz 80 80 4 6
1 2 kHz 40 80 3 8
2 4 kHz 40 160 2 12
4 8 kHz 40 320 2 14
8 16 kHz 20 320 3 -
Table 7.1.1 Parameters used for classical music
96
These were the values selected at the beginning of the algorithm. The compression ratio
can be calculated by comparing the number of samples stored in the time-frequency
matrix for one second with the number of samples in time stored for one second in the
original signal. Since the sampling rate is 32 kHz, the number of samples per second in
the original signal is 32000. In a given sub-band of the time-frequency matrix:
Number of frequencies per second <(max frequency peaks)x(blocks per second) (7.1.1)
The < operator is used because the psychoacoustic model usually discards some data,
but it is not possible to know exactly how much beforehand. The total number of
frequencies stored per second is calculated as the sum of the maximum number of
frequencies stored per second for each block.
Figure 7.1.1 (a) The original music signal and (b) the synthesized signal
Total frequencies/sec =(6x12.5) +(6x12.5) +(8X25) +(12x25) +(14x50)
=1350
There is also a separate magnitude matrix containing the magnitudes of these frequencies,
which contains as many magnitudes as frequencies. So the total number of samples
required to be stored per second is double the above number, which is 2700.
97
Compression ratio =32000/2700
Compression Ratio > 11:1
A compression ratio greater than 1:11 can be achieved with real recorded music. The
comparison between the original signal and the synthesized signal is shown in figures
7.1.1a and b.
Computational Power Saved
Matrix inversions saved per second: 50 80x80 matrices, 25 160x160 matrices, and 25
320x320 matrices
Iterations saved per second: 6575
7.2 GUITAR CHORD
A clean electric guitar playing a chord was recorded via microphone. This piece was
analyzed and re-synthesized. The parameters used are shown in table 7.2.1.:
Block Block Size
(ms)
Block Size
(samples)
Iterations Max. Frequency
Peaks
0 500 Hz 40 40 5 5
500 1000 Hz 40 40 4 6
1 2 kHz 40 80 3 8
2 4 kHz 40 160 2 12
4 8 kHz 40 320 2 14
8 16 kHz 20 320 3 -
Table 7.2.1 Parameters used for the guitar chord
Compression ratio >32000/(5+6+8+12+14)(2)(25)
Compression ratio > 14:1
The results for analysis and re-synthesis of the guitar chord are shown in figures 7.2.1a
and b.
98
Figure 7.2.1 (a) Original signal and (b) Synthesized signal
Computational Power Saved
Matrix inversions saved per second: 50 40x40 matrices, 25 80x80 matrices, 25 160x160
matrices, and 25 320x320 matrices
Iterations saved per second: 8125
7.3 CLARINET PATCH FROM SYNTHESIZER
A clarinet patch was selected from a Korg X1D synthesizer. It was a single note played
for approximately 1 second and the parameters used are shown in table 7.3.1:
Block Block Size
(ms)
Block Size
(samples)
Iterations Max. Frequency
Peaks
0 500 Hz 40 40 5 3
500 1000 Hz 40 40 4 4
1 2 kHz 40 80 3 7
2 4 kHz 40 160 2 9
4 8 kHz 40 320 2 9
8 16 kHz 20 320 3 -
Table 7.3.1 - Parameters used for the clarinet patch
99
Compression ratio >32000/(3+4+7+9+9)(2)(25)
Compression ratio > 20:1
The results for analysis and re-synthesis of the clarinet patch are shown in figures 7.3.1a
and b.
Figure 7.3.1 (a) Clarinet original signal and (b) the synthesized signal
Computational Power Saved
Matrix inversions saved per second: 50 40x40 matrices, 25 80x80 matrices, 25 160x160
matrices, and 25 320x320 matrices
Iterations saved per second: 5600
7.4 SPEECH
For the case of speech, three different sets of compression levels were achieved by
varying the maximum number of frequency peaks allowed in each case. The speech
sample used was the sentence Check this out spoken by the author. Table 7.4.1 shows
the parameters used for speech.
100
Max. Frequency peaks Block Block Size
(ms)
Block Size
(samples)
Iterations
Set 1 Set 2 Set 3
0 500 Hz 20 20 5 4 4 3
500 1000 Hz 20 20 4 8 8 5
1 2 kHz 40 80 3 12 12 6
2 4 kHz 40 160 2 10 10 6
4 8 kHz 40 320 2 14 - -
8 16 kHz 20 320 3 - - -
Table 7.4.1 Parameters used for speech
Set 1
Compression ratio >32000/(4+8+12+10+14)(2)(25)
Compression ratio > 10:1
Set 2
Compression ratio >32000/(4+8+12+10)(2)(25)
Compression ratio > 14:1
Set 3
Compression ratio >32000/(3+5+6+6)(2)(25)
Compression ratio > 23:1
On listening to both the original and synthesized version it was found that the quality of
the synthesized version was quite good except for a hollow reverberant quality to the
synthesized version. The more the compression that was required, the more prominent
this reverberant quality became up to a point where the speech sounded very unnatural
but at the same time, quite intelligible. The results for the first set of parameters (using
the highest number of frequency peaks) are shown in figures 7.4.1a and b.
101
Figure 7.4.1 (a) Original speech signal and (b) the synthesized version (set 1)
Computational Power Saved (Set 1)
Matrix inversions saved per second: 100 20x20 matrices, 25 80x80 matrices, 25
160x160 matrices, and 25 320x320 matrices
Iterations saved per second: 10300,
7.5 LISTENING TESTS
Listening tests were performed to test the validity of two assumptions made in this study.
The first assumption was that the frequency resolution curve that was used to calculate
the number of iterations held good for complex audio signals. The second assumption
was that phase information was not important perceptually for the sinusoidal model used
and was therefore discarded.
Frequency Resolution:
To test the validity of the first assumption, the signal was synthesized 5 times using 5
different numbers of iterations performed. The number of iterations performed originally
102
for each signal is shown in tables 7.1.1, 7.2.1, 7.3.1, and 7.4.1. For each of the four audio
signals in the above chapters, the signal was re-synthesized with an additional +2, +1, 0,
-1, and -2 iterations for each band shown in the tables. The objective was to test if
subjects were able to identify a difference in quality if more or less iterations were
performed than the ones calculated from the frequency resolution curve (chapter 5.3).
ABX tests were conducted where one of the signals was always the one with the highest
number of iterations (+2 iterations) and the other was the same signal synthesized with a
lower number of iterations (given in the column marked iterations). The results are
shown in the following tables. The Xs mark correct answers by that particular subject.
CHORD
Iterations Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Subject 6 Subject 7
+1 - - - - - - -
+0 - - - - - - -
-1 X - X - - - X
-2 - - X X - - X
CLARINET
Iterations Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Subject 6 Subject 7
+1 - - X - X - -
+0 - - - - - - -
-1 - - X - X X -
-2 - - - - X - -
CLASSICAL
Iterations Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Subject 6 Subject 7
+1 - - - - - - -
+0 - - - - - - -
-1 - - - - - - -
-2 - - - - X - -
103
SPEECH
Iterations Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Subject 6 Subject 7
+1 - - - - - - -
+0 - - - - - - -
-1 - - - - - - -
-2 - - - - - - -
Table 7.5.1 Results of listening tests for frequency resolution
From table 7.5.1, the total number of correct responses for all four audio signals are now
shown for different numbers of iterations in comparison to the case of +2 iterations in
table 7.5.2. Responses are shown as a ratio of the number of correct responses to the
highest possible number of correct responses.
Number of iterations Number of correct responses
+1 2/35
+0 0/35
-1 6/35
-2 5/35
Table 7.5.2 Summed up results of listening tests for frequency resolution
It can be seen from the table that for lower numbers of iterations, a higher number of
subjects tended to answer correctly. For the case where the number of iterations were as
calculated from the graph (+0), there were no correct answers. This suggests that for the
control group used, the number of iterations used originally was sufficient.
Phase:
To test the effects of including phase and using it to re-synthesize the signal, listening
tests were performed. The same seven subjects were presented with the original signal, a
re-synthesized version without using phase and a re-synthesized version, using phase.
This was done for all four signals and subjects were asked to compare the two re-
104
synthesized versions and state which one they thought was better (or closer in quality to
the original). There was also an option to mark neither if they could not tell the
difference. The results are now presented in table 7.5.3. In each case the total number of
subjects who marked that choice are shown.
Subject Version without
phase preferred
Version with phase
preferred
Neither preferred
Chord 4/7 2/7 1/7
Clarinet 7/7 0/7 0/7
Classical 3/7 1/7 3/7
Speech 6/7 0/7 1/7
Total 20/28 3/28 5/28
Table 7.5.3 - Results of listening tests for phase
From the table, it is seen that a majority of the control group preferred re-synthesized
versions without phase. The reason for this is that re-synthesized versions that used
phase, tended to be slightly off-pitch because of the following reasons. All sub-bands
were processed using short time frames. For lower frequencies (e.g. 100 Hz), the number
of cycles completed in that short time frame (e.g. 40ms) was very small (e.g. 4 cycles). If
phase information was used during re-synthesis (e.g. 0 radians in the first frame to pi/2
radians in the second frame), then the number of cycles required to be completed in that
time frame would be slightly different (e.g. 4.5 cycles instead of 4 cycles) in order to
maintain the phase relationships. But for lower frequencies and short windows, this
results in a slight pitch shift because it effectively changes the number of cycles per
second (frequency) being generated. Since the higher frequencies are not affected much
by this problem, they stay intact. This results in an overall off-tune effect. In general, the
results of this listening test suggest that it may be best to ignore phase in the particular
method of analysis/re-synthesis used in this study.
105
7.6 CONCLUSIONS
The sinusoidal model for representing audio signals used in this study is found to be
useful for representing certain types of audio signals. From the results, it is found that for
audio signals that are not very complex such as the guitar chord, the synthesizer patch,
and to a certain extent the speech case, it seems to perform quite well. In fact the results
for the synthesizer patch were excellent. On the other hand, for audio signals that were
quite complex in terms of having a large number of frequencies, the algorithm did not
perform as well as expected. It is concluded that the reason for this is the problem of
nearby frequencies interfering with each other and creating shifts in the frequency peaks
and also resulting in spurious peaks. The reason in turn for this is the fact that the signal
is being analyzed in short segments or windows, which result in the actual spectrum
being convolved with the window spectrum. In this case, the window is a rectangular
window, which seems to give the best results because though there are high side-lobes,
the main-lobe width is small. The more the numbers of frequencies present in a signal,
and particularly, the closer they are spaced, the worse the performance of the algorithm
is. This is due to the reasons stated above.
Though the procedure of dividing the signal into sub-bands has its own merits, there is
one problem that occurs due to it. When dividing the signal into sub-bands, the first step
is to use the filter bank, which consists of a bank of complimentary high pass and low
pass filters combined with down-sampling operators as shown in figure 5.3.1.
Effectively, the signal is divided into bands by high pass filtering as well as low pass
filtering at frequencies of 8 kHz, 4kHz, 2kHz, 1kHz and 500 Hz. The filter used for these
operations is a high order elliptic filter. The filter is designed to have very high
attenuation at these frequencies in order to avoid aliasing problems. Subsequently, if
these frequencies are present in the given audio signal, they are highly attenuated. As an
example, when a guitar chord which has its fundamental note at 500 Hz is analyzed and
re-synthesized, the results are very poor. The problem is especially bad because the
frequencies are correlated in a musical sense. Since they are all an octave above each
106
other, they correspond to different octave levels of the same note. As a result, if this note
is present in the music, it is reproduced very poorly.
7.7 IMPROVEMENTS
The method can be used with more success if careful research is done on the window (or
frame) lengths required for different types of signals. In general, a window length, which
is as small as possible without damaging the spectral characteristics of the windowed
signal, is desirable. If some optimization procedure is used to adapt to a signal and
determine an optimum window length for its various sub-bands, the algorithm should
perform better.
Also, variable window length within a sub-band depending on the characteristics of the
signal during that period of time would result in an improvement in performance. In this
method, when the signal is fairly stationary, long windows would be used and when it is
undergoing rapid changes, the algorithm would shift to shorter windows.
Another factor worth researching is the type of window to be applied on each frame. In
this study, the rectangular window was used because in comparison to other conventional
windows such as the Hamming window and the Hann window, it gave better results. It is
possible though that depending on the nature of the signal, different types of windows
would give different results. A method could be developed which adapts to the nature of
the signal and uses the most suitable window.
One problem related to compression issues and also processing time is the fact that in
higher octaves, more frequencies are adapted to. This is because the actual range of
frequencies in higher octaves is broader and more frequencies are required to make the
synthesized version sound more natural. This contradicts the notion that most of the
information is contained in lower and mid octaves up to around 4 kHz and therefore these
should be analyzed with the highest number of frequency peaks. In the higher octaves, at
107
least for the case of music, the information seems to be more like shaped white noise (e.g.
cymbals). If the higher octaves could be modeled using this shaped white noise model
instead of the pure sinusoidal model used in the lower octaves, better results and more
compression could be obtained due to reduced frequency peaks (used only for shaping
the noise).
7.8 SUMMARY
The aim of the study was to develop a method to decompose audio signals in the form of
their individual frequency components and show their variation in time. The application
explored was audio compression. The main problem in the developed method was the
fact that a finite length window in time always creates spectral spreading. Among all the
parameters mentioned in the various tables shown in this chapter, block length (window
length) was probably the most crucial factor. If it was too large, the reconstructed signal
tended to sound smeared in time, with no transients. If it was too small, the frequencies
used to reconstruct the piece were off because of increased spectral spreading.
The main advantage offered by this algorithm (in addition to the greatly reduced
computations) is the fact that window lengths are adjustable, which was not the case in
earlier work. Other parameters such as maximum number of frequency peaks, number of
iterations performed, type of window used and the actual psychoacoustic model used can
also be changed as required. The algorithm is set up perfectly for further research into
finding optimum values for these parameters. Once these parameter values are optimized
and a balance between the spectral spreading problem and the time resolution problem is
achieved, the actual time-frequency representation will approach the ideal case. It will
then be possible to alter the characteristics of the signal in an incredible number of ways
by simply altering any component in the desired way. Applications such as pitch shifting,
time compression and expansion, and even pitch correction could be easily implemented.
In conclusion, the next step towards improving this method is to automate the process of
finding optimum values for the above described parameters. Once the time-frequency
representation is as good as possible, any related application becomes trivial.
108
APPENDIX A
The transfer functions used for the low pass and high pass filters are now given.
Low-Pass Filter
7 6 5 4 3 2 1
7 6 5 4 3 2 1
1200 . 0 5652 . 0 4616 . 1 5859 . 2 2124 . 3 1311 . 3 8692 . 1 1
0077 . 0 0408 . 0 103 . 0 1581 . 0 1581 . 0 103 . 0 0408 . 0 0077 . 0
) (


+ + +
+ + + + + + +

z z z z z z z
z z z z z z z
z H
High-Pass Filter
7 6 5 4 3 2 1
7 6 5 4 3 2 1
1200 . 0 5652 . 0 4616 . 1 5859 . 2 2124 . 3 1311 . 3 8692 . 1 1
0077 . 0 0408 . 0 103 . 0 1581 . 0 1581 . 0 103 . 0 0408 . 0 0077 . 0
) (


+ + + + + + +
+ + +

z z z z z z z
z z z z z z z
z H
109
References:
1) Cohen, L. (1995), Time-Frequency Analysis, Prentice Hall PTR, New J ersey
2) Lindquist, C.S. (1989), Adaptive and Digital Signal Processing, Steward & Sons,
Miami
3) Proakis, J .G. & Manolakis D. (1988), Introduction to Digital Signal Processing,
Macmillan Publishing Company, New York
4) Rabiner, L.R. & Schafer, R.W. (1978), Digital Processing of Speech Signals, Prentice
Hall Inc., Englewood Cliffs, New J ersey
5) Vaidyanathan, P.P. (1993), Multirate Systems and Filter Banks, Prentice Hall Inc.,
Englewood Cliffs, New J ersey
6) Pohlmann, K.C. (1995), Principles of Digital Audio, McGraw-Hill Inc., New York
7) Dodge, C., J erse, T.A. (1997), Computer Music, Simon & Schuster Macmillan, New
York
8) Roederer, J .G. (1995), The Physics and Psychophysics of Music, Springer-Verlag,
New York
9) Tobias, J .V. (1970), Foundations of Modern Auditory Theory, Academic Press Inc.,
New York
10) Kotvis, M. (1997), An Adaptive Time-Frequency Distribution with Applications for
Audio Signal Separation, Masters Thesis, Dept. of Music Engineering, University of
Miami
11) Dologlou, I., Bakamidis, S., Carayannis, G., Signal Decomposition in Terms of Non-
Orthogonal Sinusoidal Bases, Signal Processing, 51 (1996), p. 79 91
12) Hyuk J eong & J eong-Guon Ih, Implementation of a New Algorithm Using the STFT
with Variable Frequency Resolution for the Time-Frequency Model, J . Audio Eng.
Soc,. 47, No. 4, (April 1999), p. 240 250
13) Portnoff, M.R., Time-Frequency Representation of Digital Signals and Systems Based
on Short-Time Fourier Analysis, IEEE Transactions on Acoustics, Speech and Signal
Processing, 28, No. 1, (Feb 1980), p. 55 69
14) Vaidyanathan, P.P., Multirate Digital Filters, Filter Banks, Polyphase Networks, and
Applications: A Tutorial, Proceedings of the IEEE, 78, No. 1 (J an 1990), p. 56 - 93
15) Mcaulay, R.J . & Quatieri, T.F., Magnitude-Only Reconstruction Using a Sinusoidal
Speech Model, Proceedings on the IEEE International Conference on Acoustics,
Speech and Signal Processing, 34 (1984), p. 27.6.1 27.6.2
16) Mcaulay, R.J . & Quatieri, T.F., Speech Analysis/Synthesis Based on a Sinusoidal
Representation, IEEE Transactions on Acoustics, Speech and Signal Processing, 34
(1986), p. 744 754
17) Serra, M.H., Rubine, D., Dannenberg, R., Analysis and Synthesis of Tones by Spectral
Interpolation, J . Audio Eng. Soc., 38, No. 3 (March 1990), p. 111 128

Das könnte Ihnen auch gefallen