Beruflich Dokumente
Kultur Dokumente
257
Proceedings of the International Computer Music Conference 2011, University of Huddersfield, UK, 31 July - 5 August 2011
In this paper, we propose an analysis based synthesis real signal of length n, i.e. s ∈ ℜn , and D = [φ1 , φ2 , . . . , φK ]
algorithm, which parameterizes the pre-recoded clapping be a dictionary, where each column φ represents an atom
sounds in the form of atoms and synthesis patterns, and and its length is n, i.e. D ∈ ℜn×K . Using dictionary-based
generate a sequence of claps from these parameters on the methods, the aim is to represent s as a weighted sum of
fly. During the analysis process, the recorded sequences atoms, which can be written as,
of claps are initially segmented into individual claps, and K
then stationary wavelet transform (SWT) is used to de- s= ∑ φk uk (1)
compose them into sound grains. These sound grains form k=1
the basis of the initial dictionary, which is further opti- where u is a column vector in ℜK and represents the ex-
mized to produce a compact and adaptive version of dic- pansion coefficients or weights. Generally, an overcom-
tionary. The segmented claps are projected onto the adap- plete dictionary D (n < K) is used, which means the ma-
tive dictionary which generates the synthesis patterns for trix D has a rank n and the weights vector u in Eq. (1) will
them. During the synthesis of clapping sounds, these pat- not have a unique solution. Therefore, some additional
terns are tuned according to the reported synthesis param- constraints need to be introduced to determine a unique or
eters either from the physical interaction or via the graphi- particular decomposition.
cal user interface (GUI). This technique generates realistic The representation of a signal given in Eq. (1) is usu-
and expressive claps sound for interactive and multimedia ally approximated instead of solving for an exact solu-
applications. tion. An adequate and commonly used approximation of
The detailed introduction of dictionary-based signal Eq. (1) is the one where the signal s is represented by
analysis and representation methods is presented in sec- selecting only j number of atoms from the dictionary D
tion 2. In section 3, the different blocks of the proposed corresponding to highest weights uk . Such representation
analysis-synthesis model are explained in detail. The need shows that the signal energy is predominated in few atoms
for an expressive synthesis model and its importance in that have highest weights. Therefore, the approximation
the natural sound synthesis context are discussed in Sec- of the signal s can be represented as,
tion 4. In section 5, the summary of achievements are j
recapitulated and the future directions of the research are
highlighted.
s = ∑ φk uk + r = Du + r (2)
k=1
where j is the number of selected atoms ( j < K), and
2. DICTIONARY-BASED SIGNAL r ∈ ℜn is residual or approximation error. The selection
REPRESENTATION TECHNIQUES of atoms and their numbers are controlled by limiting the
value of approximation error. By applying such criterion,
Conventional signal analysis and representation techniq- the approximation solution given in Eq. (2) can be rede-
ues, such as Fourier transform (FT) and short-time Fourier fined as,
transform (STFT), use Fourier basis to represent the signal
s ≈ Du such that $s − Du$2 ≤ ε (3)
as a superposition of fixed basis functions i.e. sinusoids.
The Fourier basis functions provide a useful representa- where ε is a given small positive number. The approxi-
tion when considering stationary signals, but most real- mation solution with the fewer number of atoms and cor-
world everyday signals are nonstationary and transient. responding weights is certainly an appealing representa-
Therefore, these analysis and representation techniques tion. Sparse or compact approximation of a signal s is
are inadequate for such signals because of their poor lo- measured using the !0 criterion, which counts the number
calization both in time and frequency. of non-zero entries of the weights vector u ∈ ℜK . Finding
Over the last few years, researchers have been inves- the optimally sparse representation consists in finding the
tigating new techniques which can represent signals in solution of
a compact form and are specialized to the signal under min $u$0 such that $s − Du$2 ≤ ε (4)
consideration. As a result, a number of basis functions u
and representation techniques [19, 10, 2] have been de- where $u$0 is the !0 -norm, which counts the number of
veloped, so that any input signal can be represented in a non-zero coefficient in weight vector u. The problem of
way that is more compact, efficient and meaningful. One finding the optimally sparse representation, i.e. with min-
of such techniques, which has gained a lot of recognition imum $u$0 , is in general a combinatorial optimization
in recent years, is the dictionary-based method as it offers problem. Constraining the solution u to have the min-
compact representation of the signal and is highly adap- imum number of nonzero elements creates an NP-hard
tive. Dictionary-based methods have been used in many problem [13] and cannot be solved easily. Therefore, ap-
signal processing applications including analysis and rep- proximation algorithms, such as basis pursuit (BP) [2],
resentation of audio signals [4, 20] and music [7]. matching pursuit (MP) [10], and orthogonal matching pur-
In dictionary-based methods, an input signal is repre- suit (OMP) [14], are employed to compute an optimal ap-
sented as a linear combination of atoms. These atoms are proximation solution of Eq. (4). The MP and OMP algo-
prototype discrete-time signals, and a collection of K such rithms are classified as greedy methods, because they ap-
signals is referred to a dictionary. Let s be a discrete-time proximate the signal iteratively where at each iteration an
258
Proceedings of the International Computer Music Conference 2011, University of Huddersfield, UK, 31 July - 5 August 2011
atom is selected from the defined dictionary which maxi- using the same clapping style. A sequence of 20 claps was
mally reduces the residual signal. These algorithms con- recorded in each clapping mode from each subject.
verge rapidly, and exhibit good approximation properties
for given criterion [10]. 3.2. Segmentation and Peak Alignment
The sparse approximation of Eq. (4) can also be im-
proved by using a specialized dictionary, which is trained The first step during the analysis of recorded hand clap-
from the signals under consideration. Instead of using pre- ping sound is to segment each sound signal into individ-
determined dictionaries, dictionary learning methods [1, ual sound events i.e. single clap, which is represented by
7] can be used to refine them. This topic is addressed in si . Each clap from a sequence of 20 is isolated by detect-
detail in section 3.4. ing and labeling its onset and offset points. Onset of each
clap is labeled by using the energy distribution method
proposed by Masri et al. [11]. This method detects the
3. ANALYSIS-SYNTHESIS ALGORITHM beginning of an impulsive event, such as clap, by observ-
ing the suddenness and the increase in energy of the attack
A dictionary-based synthesis algorithm presented here ge-
transient. Short-time energy of the signal is used to locate
nerates the clapping sounds using the parametric represen-
the offset of each clap. Starting from the onset of each
tation modeled from the recorded sounds. Fig. 1 depicts
event, the short-time energy is calculated with overlapped
the building blocks of the proposed analysis-synthesis sc-
frames, and compared against a constant threshold to de-
heme. The algorithm takes recorded continuous clapping
termine the offset.
sounds and split them into sound grains. During parame-
There is no constraint on the number of claps or events
terization phase, the clapping sounds are represented by
taken from each clapping mode. Equal or different num-
synthesis patterns and an adaptive dictionary, which is
ber of events can be selected from each clapper and their
trained from these sound grains. The target clap is gen-
clapping modes. For simplicity, an equal number of claps,
erated at the synthesis stage where a pattern is selected
i.e. 20, were taken from each clapping mode of a clap-
and adjusted according to the reported parameters. In the
per. Once the claps are selected and segmented, they are
following sections, each part of the algorithm is discussed
peak aligned by means of cross-correlation such that the
in detail.
highest peaks occur at the same point in time. This in-
creases the similarities between the extracted sound grains
3.1. The Database and improves the dictionary learning process. The set of
collected sound events are put into a matrix form as,
The proposed synthesis scheme models the hand clapping
sounds through the analysis of generated sounds. There-
S = [s1 , s2 , . . . , sm ] (5)
fore, a set of hand clapping sounds were recorded. The
recordings of hand clapping sounds were made in an acou- where each column represents a clap (sound event) and
stical booth (T60 < 100 ms) at a sampling rate of 44.1 length of each clap is n. Zero padding is used for any
kHz. The recording was made with two males and one segmented clap whose length is less than n. For this ex-
female subject, all aged between 20 and 35. Each subject periment, n = 2048 and m = 180.
was seated in the acoustical booth alone and a microphone
was placed about 100 cm away from their hands. Each 3.3. Sound Grains
subject was asked to clap at their most comfortable or nat-
ural rate (i.e. clapping mode) using his/her conventional The main idea behind the proposed analysis scheme is
hand configuration (i.e. clapping style). Then the subject that we want to represent the recorded clapping sounds
was asked to clap at very enthusiastic and very bored rates in a way that i) the similarities and differences between
259
Proceedings of the International Computer Music Conference 2011, University of Huddersfield, UK, 31 July - 5 August 2011
Figure 2. (a) Decomposition tree of SWT, (b) SWT filters, (c) construction of a sound grain.
clapping sounds recorded from different subjects can be duce grain matrix Λ = [λi : i = 1, 2, . . . , p], where λi form
observed and parameterized, and ii) this parametric repre- the columns of the grain matrix and the number of total
sentation can be manipulated in various ways to generate grains are p = m × (L + 1).
sound effects at synthesis stage. Clapping sounds belong The selection of wavelet type from the family of wave-
to transient signal family that is non-stationary. Based lets (i.e. Haar, Daubechies, etc.) and their decomposi-
on the frequency resolution properties of the human audi- tion level depend on the input sound signal, application
tory system, such signals can be split into layers of grains, area, and the representation model. This is an iterative
where the energy of each grain is present at particular fre- process where the best wavelet type and optimum decom-
quency or scale. The information in each grain and the position level are obtained by evaluating the perceived
overall structure of these grains are analyzed and repre- quality of the synthesized sounds generated from the dif-
sented based on human auditory system. Such parametric ferent wavelet types and decomposition levels. Based on
representation can be used to compare the characteristics listening tests, db4 wavelet with 5th -level decomposition
of different sounds [21]. Furthermore, during the synthe- (L = 5) is used in the presented work.
sis process, the parameters representing these grains can
be manipulated in various ways to control the generated 3.4. Dictionary and Synthesis Patterns
sound.
The proper parameterization of the sound features extract-
The discrete wavelet transform (DWT) has gained wi-
ed from the analysis part is an essential element of the
despread recognition and popularity due to its ability to
synthesis systems. In the proposed scheme, a dictionary-
underline and represent time-varying spectral properties
based approach is used to create a parametric represen-
of many transient and other nonstationary signals, and of-
tation of the recorded sounds. The similarities and dif-
fers localization both in time and frequency. Stationary
ferences of the sound grains, as well as their relation-
wavelet transform (SWT) [12, 16] is a real-valued exten-
ships to the input sounds are preserved and reflected in
sion to the standard DWT, which intended to solve the
the presented parametric representation. One key advan-
shift-invariance problem of the DWT. Therefore, in the
tage of dictionary-based signal representation methods is
proposed analysis scheme, the SWT is used to extract the
the adaptivity of the composing atoms. This gives the
sound grains from the clapping sound.
user the ability to make a decomposition suited to specific
The SWT is applied to each clap si which decom- structures in a signal. Therefore, one can select a dic-
poses it into two sets of wavelet coefficient vectors: the tionary either from a prespecified set of bases functions,
approximation coefficients ca1 and the detail coefficients such as wavelets, wavelet packets, Gabor, cosine pack-
cd1 , where the subscript represents the level of decompo- ets, chirplets, warplets etc., or design one by adapting its
sition. The approximation coefficient vector ca1 is further content to fit a given set of signals, such as dictionary of
split into two parts, ca2 and cd2 , using the scheme shown instrument-specific harmonic atoms [7].
in Fig. 2(a). This decomposition process continues up to Choosing a prespecified basis matrix is appealing be-
Lth level which produces the following set of coefficient cause of its simplicity, but there is no guarantee that these
vectors: [cd1 , cd2 , . . . , cdL , caL ]. The approximation coef- basis will lead to the compact representation of all type of
ficients represent the low-frequency components, whereas signals. The performance of such dictionaries depends on
the detail coefficients represent the high-frequency com- how suitable they are to describe these signals. However,
ponents. To construct the sound grains from coefficients there are many potential application areas, such as tran-
vectors, the inverse SWT is applied to each coefficient sient and complex music sound signals, where fixed basis
vector individually by setting all others to zero, which pro- expansions are not well-suited to model this type of sig-
duces the following bandlimited sound grains: [λ1 , λ2 , . . . , nals. A compact decomposition is best achieved when the
λL+1 ]. The block diagram of the process of acquiring the elements of the dictionary have strong similarities with
sound grains from coefficient vectors is shown in Fig. 2(c). the signal under consideration. In this case, a fewer set
Each grain contains unique information from the sound of more specialized basis functions in the dictionary is
event and its length is the same as the sound event. The needed to describe the significant characteristics of the
entire clap matrix S is split into sound grains which pro- signal [7, 17, 8]. Ideally, the basis itself should be adapted
260
Proceedings of the International Computer Music Conference 2011, University of Huddersfield, UK, 31 July - 5 August 2011
to the specific class of signals which are used to compose the best sound parameters. During synthesis process, a
the original signal. As we are dealing with a specific class clap sound from the claps matrix S can be synthesized by
of transient signals, we believe that it is more appropriate selecting and adding the corresponding synthesis pattern
to consider designing dictionaries based on learning. and the dictionary atoms, which can be written as,
Given training clapping sound and using adaptive trai-
ning process, we seek a dictionary that yields compact ŝi ∼
= ∑ φ j ui ( j) (7)
representations of the claps matrix, S. Aharon et al. [1] j∈J
proposed such a method, named as K-SVD algorithm, wh- where vector J contains the T0 number of indices of the
ich trains a dictionary from the given training signals. The non-zero entries in ui . The perceptual quality of the syn-
K-SVD algorithm is a very effective technique, and has thesized clap sound ŝi is directly related to the number of
been used in many image processing applications [3, 9]. It non-zero entries in ui . The quality of synthesized clap
is based on an iterative process of optimization to produce sound ŝi improves sharply for the first few atoms but be-
a sparse representation of the given samples using the cur- come imperceptible after a particular value of T0 .
rent dictionary, and updating the atoms until the best rep- The synthesis Eq. (7) is used to generate a set of three
resentation is reached. Dictionary columns are updated synthesized claps taken from three different subjects at
along with the sparse representation coefficients related to T0 = [5, 10, 15]. It can be observed from Figs. 3, 4, and 5
it which accelerate the convergence. that as the number of non-zero entries T0 increases in the
In the proposed scheme, the K-SVD algorithm is used weights vector ui , the synthesized clap closely approxi-
to train an adaptive dictionary D which determines the mate the original sound with the best approximation achi-
best possible representation of the given clapping sounds. eved at the least value of T0 = 15, after which the gain in
The K-SVD algorithm takes the sound grains matrix Λ, as the approximation is insignificant.
initial dictionary D0 , a number of iterations j, and a set of
given example signals, i.e. claps matrix S. Finding sparser
representation of the sound events in S consists in solving
0.15
ui 0.05
it.
The orthogonal matching pursuit (OMP) is used to Figure 3. Original and synthesized claps from subject-1
find the decomposition pattern of the input claps over the at T0 = [5, 10, 15].
trained dictionary. The OMP is a greedy step-wise regres-
sion algorithm, and it is applied in the proposed scheme to
approximate the solution of the sparsity-constrained sparse 0.5
each stage the dictionary atom having the maximal pro- 0.2
Synthesized with T0 = 15
mation about the subjects and clapping modes are labeled !0.4
onto each synthesis pattern for future reference and for the !0.5
261
Proceedings of the International Computer Music Conference 2011, University of Huddersfield, UK, 31 July - 5 August 2011
0
teners gave the same feedback when these samples were
!0.02 played back.
!0.04
!0.06
!0.08
200 300 400 500 600 700 800 900 1000 Original u
Samples 0.8
u + alpha1
u + alpha2
and D, to synthesize a clap sound. Every time Eq. (7) !0.2 Approximated using (u) with T0 = 15
Expressive using (u + alpha1)
a clap. For expressive synthesis, when a clap sound ŝi is 200 250 300 350 400 450 500
Samples
550 600 650 700 750
262
Proceedings of the International Computer Music Conference 2011, University of Huddersfield, UK, 31 July - 5 August 2011
synthesis method was presented which can generate non- [9] J. Mairal, M. Elad, and G. Sapiro, “Sparse represen-
repetitive and customized clapping sounds. The simulated tation for color image restoration,” IEEE Transac-
claps showed that as the number of non-zero entries in tions on Image Processing,, vol. 17, no. 1, pp. 53
weights vector increased to generate the target clap, the –69, Jan 2008.
synthesized clap sound gets closer to the original clap. In
[10] S. G. Mallat and Z. Zhang, “Matching pursuits with
some cases, even five weights (i.e. T0 = 5) and their cor-
time-frequency dictionaries,” IEEE Transactions on
responding atoms were sufficient to generate a clap with
Signal Processing, vol. 41, no. 12, pp. 3397–3415,
a satisfactory level of perceived sound quality. It was also
Dec 1993.
observed that an approximation sound with T0 = 15 was
sufficient to yield an excellent perceived sound quality. [11] P. Masri and A. Bateman, “Improved modeling of
In future, we will create control models for a single attack transients in music analysis-resynthesis,” in
clapper and for several clappers with different clapping Proc. of Int. Computer Music Conf. (ICMC-96),
modes. We would also like to further investigate the ex- Hong Kong, China, Aug 1996, pp. 100–103.
pressive synthesis model and analyze the distribution of
synthesis patterns of real life sound events and their pos- [12] G. P. Nason and B. Silverman, “The stationary
sible statistical or mathematical modeling. We will then wavelet transform and some statistical applications,”
evaluate the perceptual quality of the synthesis model us- in Lecture Notes in Statistics, 103, 1995, pp. 281–
ing subjective tests. 299.
[13] B. K. Natarajan, “Sparse approximate solutions
6. REFERENCES to linear systems,” SIAM Journal on Computing,
vol. 24, no. 2, pp. 227–234, Apr 1995.
[1] M. Aharon, M. Elad, and A. Bruckstein, “K-svd:
An algorithm for designing overcomplete dictionar- [14] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad,
ies for sparse representation,” IEEE Transactions on “Orthogonal matching pursuit: Recursive function
Signal Processing, vol. 54, no. 11, pp. 4311–4322, approximation with applications to wavelet decom-
Nov 2006. position,” in Proc. of Asilomar Conf. on Signals, Sys-
tems and Computers, vol. 1, Nov 1993, pp. 40–44.
[2] S. S. Chen, D. L. Donoho, and M. A. Saunders,
“Atomic decomposition by basis pursuit,” SIAM [15] L. Peltola, C. Erkut, P. R. Cook, and V. Välimäki,
Journal on Scientific Computing, vol. 20, no. 1, pp. “Synthesis of hand clapping sounds,” IEEE Trans-
33–61, Dec 1998. actions on Audio, Speech and Language Processing,
vol. 15, no. 3, pp. 1021–1029, Mar 2007.
[3] M. Elad and M. Aharon, “Image denoising via
sparse and redundant representations over learned [16] J. C. Pesquet, H. Krim, and H. Carfatan, “Time-
dictionaries,” IEEE Transactions on Image Process- invariant orthonormal wavelet representations,”
ing, vol. 15, no. 12, pp. 3736 –3745, Dec 2006. IEEE Transactions on Signal Processing, vol. 44,
no. 8, pp. 1964–1970, August 1996.
[4] R. Gribonval and E. Bacry, “Harmonic decomposi-
tion of audio signals with matching pursuit,” IEEE [17] M. D. Plumbley, T. Blumensath, L. Daudet, R. Gri-
Transactions on Signal Processing, vol. 51, no. 1, bonval, and M. E. Davies, “Sparse representations
pp. 101–111, Jan 2003. in audio and music: From coding to source separa-
tion,” Proceedings of the IEEE, vol. 98, no. 6, pp.
[5] K. Hanahara, Y. Tada, and T. Muroi, “Human-robot
995–1005, Jun 2010.
communication by means of hand-clapping (prelim-
inary experiment with hand-clapping language),” in [18] B. H. Repp, “The sound of two hands clapping: An
Proc. of IEEE Int. Conf. on Systems, Man and Cy- exploratory study,” Journal of the Acoustical Society
bernetics (ISIC-2007), Oct 2007, pp. 2995 –3000. of America, vol. 81, no. 4, pp. 1100–1109, Apr 1987.
[6] A. Jylhä and C. Erkut, “Inferring the hand configu- [19] C. Roads, “Introduction to granular synthesis,” Com-
ration from hand clapping sounds,” in Proc. of Int. puter Music Journal, vol. 12, no. 2, pp. 11–13, 1988.
Conf. on Digital Audio Effects (DAFx-08), Espoo,
Finland, Sep 2008. [20] B. L. Sturm, C. Roads, A. McLeran, and J. J. Shynk,
“Analysis, visualization, and transformation of audio
[7] P. Leveau, E. Vincent, G. Richard, and L. Daudet, signals using dictionary-based methods,” in Proc. of
“Instrument-specific harmonic atoms for mid-level Int. Computer Music Conf. (ICMC-2008), Belfast,
music representation,” IEEE Transactions on Audio, Northern Ireland, Aug 2008.
Speech and Language Processing, vol. 16, no. 1, pp.
116–128, Jan 2008. [21] S. Tucker and G. J. Brown, “Classification of tran-
sient sonar sounds using perceptually motivated
[8] M. S. Lewicki, T. J. Sejnowski, and H. Hughes, features,” IEEE Journal of Oceanic Engineering,
“Learning overcomplete representations,” Neural vol. 30, no. 3, pp. 588–600, Jul 2005.
Computation, vol. 12, no. 2, pp. 337–365, Feb 2000.
263