Beruflich Dokumente
Kultur Dokumente
DeLiang Wang
John Woodruff and Yipeng Li
Dept. of Computer Science and Engineering
Dept. of Computer Science and Engineering
& the Center for Cognitive Science
The Ohio State University
The Ohio State University
{woodrufj, liyip}@cse.ohio-state.edu
dwang@cse.ohio-state.edu
538
ISMIR 2008 – Session 4c – Automatic Music Analysis and Transcription
Log−Amplitude
a monaural music separation system which incorporates the
3
proposed algorithm. Section 5 shows quantitative evalua-
tion results of our separation system and Section 6 provides 2
a final discussion.
1
2 SINUSOIDAL MODELING 0
1 3
5 7 26
9 11 21
13 15 16
Modeling a harmonic sound source as the summation of in- 17 19
1
6
11
0.6
frequency, and phase of sinusoidal component hn , respec-
tively, of source n at time frame m. Hn denotes the number 0.4
539
ISMIR 2008 – Session 4c – Automatic Music Analysis and Transcription
3
3
2
2.5
1
2
0
1.5
−1
1
−2
0.5
−3 True Phase Change
Predicted Phase Change 0
−4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.5 0.9 1.4 1.9 2.3 2.8 3.3 3.7 4.2 4.6 Harmonic Number
Time (sec)
Therefore the phase change can be predicted from the pitch Rn (m, k) = rmn0 →m ei l=m0 n (l)
W (kfb − hn Fn (m)),
of a harmonic source. Figure 3 shows the phase change be- (6)
tween successive time frames as measured from the first har- then Equation (2) becomes
monic of a flute recording, and the predicted phase change X(m, k) = Rn (m, k)Snhn (m0 ). (7)
using the true pitch of the signal. The predicted phase from n
540
ISMIR 2008 – Session 4c – Automatic Music Analysis and Transcription
Magnitude
100 100
frequency bins from k0 and k1 , we can write Equation (7) 60 60
20 20
for each T-F unit in the region and the set of equations can
4 20 25 4 20 25
be represented as 2 5 10 15 2 5 10 15
(c)
X = RS, (8)
Magnitude
140
where,
⎛ ⎞ 100
X(m0 , k0 ) 60
20
⎜ .. ⎟ 4 20 25
⎜ . ⎟ 2 5 10 15
⎜ ⎟ (d) (e)
X = ⎜ X(m0 , k1 ) ⎟, (9)
⎜ .. ⎟
⎝ ⎠
Magnitude
100 100
. 60 60
20 20
X(m1 , k1 )
⎛ ⎞
4 20 25 4 20 25
2 5 10 15 2 5 10 15
R1 (m0 , k0 ) . . . RN (m0 , k0 ) Bin Frame Bin Frame
⎜ .. .. ⎟
⎜ . . ⎟
⎜ ⎟
R = ⎜ R1 (m0 , k1 ) . . . RN (m0 , k1 ) ⎟ , and (10)
⎜ .. .. ⎟
⎝ ⎠ Figure 5. LS estimation of overlapped harmonics. (a) The
. .
R1 (m1 , k1 ) . . . RN (m1 , k1 ) magnitude spectrum of a harmonic of the first source in the
⎛ h1 ⎞ overlapping T-F region. (b) The magnitude spectrum of a
S1 (m0 ) harmonic of the second source in the same T-F region. (c)
⎜ .. ⎟ The magnitude spectrum of the mixture at the same T-F
S=⎝ . ⎠. (11)
hN
region. (d) The estimated magnitude spectrum of the har-
SN (m0 ) monic from the first source. (e) The estimated magnitude
The coefficient matrix R is constructed according to Equa- spectrum of the harmonic from the second source.
tion (6) for each T-F unit. X is a vector of the observed
spectral values of the mixture in the overlapping region. We
seek a solution for S to minimize the sum of squared error system is a polyphonic mixture and pitch contours of indi-
vidual sources. As mentioned previously, we use ground
J = (X − RS)H (X − RS). (12) truth pitch estimated from the clean signals for each source.
In the harmonic labeling stage, the pitches are used to iden-
The least-squares solution is given by
tify overlapping and non-overlapping harmonics.
S = (RH R)−1 RH X, (13) To formalize the notion of overlapping harmonics, we
say that harmonics hn1 and hn2 for sources n1 and n2 , re-
where H denotes conjugate transpose. After Snhn (m0 ) is spectively, overlap when their frequencies are sufficiently
estimated for each of the sources active in the overlapping h h
close, |fn1n1 (m) − fn2n2 (m)| < θf . If one assumes the sig-
region, we use Equation (5) to calculate Snhn (m) for all m ∈
nals strictly adhere to the sinusoidal model, the bandwidth
[m0 , m1 ].
of W determines how many frequency bins will contain en-
Figure 5 shows the effectiveness of the proposed algo-
ergy from a sinusoidal component and one can set an ampli-
rithm in recovering two overlapping harmonics for two in-
tude threshold to determine θf .
struments. In this case, the third harmonic of the first source
For non-overlapped harmonics, sinusoidal parameters are
overlaps with the fourth harmonic of the second source. Fig-
estimated by minimizing the sum of squared error between
ure 5(c) shows the magnitude spectrum of the mixture in the
the mixture and the predicted source energy,
overlapping region. Note that the amplitude modulation re-
sults from the relative phase of the two harmonics. The es-
timated magnitude spectra of the two harmonics are shown J= |X(m, k) − W (kfb − hn Fn (m))Snhn (m)|2 ,
hn
in Figure 5(d) and (e). For comparison, the magnitude spec- k∈Kn (m)
tra of the two sources obtained from pre-mixed signals are (14)
shown in Figure 5(a) and (b). It is clear that the estimated where Knhn (m) is the set of frequency bins associated with
magnitude spectra are very close to the true magnitude spec- harmonic hn in frame m. The solution is given by:
tra.
k∈Knhn
(m) X(m, k)W (hFn (m) − kfb )
hn
Sn (m) = .
(m) |W (hFn (m) − kfb )|
hn
2
4 A MONAURAL MUSIC SEPARATION SYSTEM k∈Kn
(15)
We incorporate the proposed algorithm into a monaural mu- As described in Section 3.3, we utilize the amplitude en-
sic separation system to evaluate its effectiveness. The dia- velope of non-overlapped harmonics to resolve overlapping
gram of the system is shown in Figure 6. The input to the harmonics. Since the envelope information is sequential,
541
ISMIR 2008 – Session 4c – Automatic Music Analysis and Transcription
Figure 6. System diagram the two instrument mixtures have 0 dB SNR and those in the
three instrument mixtures have roughly -3 dB SNR. Details
about the synthesis procedure can be found in [16]. Admit-
we resolve overlapped hn for time frames [m0 , m1 ] using tedly, audio signals generated in this way are a rough ap-
a non-overlapped harmonic h∗n . To determine appropriate proximation of real recordings, but they show realistic spec-
time frames for this processing, we first identify sequences tral and temporal variations.
of time frames for which a harmonic hn is overlapped with
one or more other harmonics. If the pitch of any of the 5.2 Results
sources contributing to the overlapping region changes, we
break the sequence of frames into subsequences. Given a For evaluation we use the signal-to-noise ratio (SNR),
sequence of frames, we choose the strongest harmonic for 2
t y (t)
each source that is unobstructed in the entire sequence as SNR = 10 log10 , (18)
h∗n . We use θf to determine the bin indices, [k0 , k1 ], of the t (ŷ(t) − y(t))
2
overlapping region.
For each overlapping region, we perform least-squares where y(t) and ŷ(t) are the clean and the estimated instru-
estimation to recover the sinusoidal parameters for each in- ment signals, respectively. We calculate the SNR gain af-
strument’s harmonics. To utilize the mixture signal as much ter separation to show the effectiveness of the proposed al-
as possible, we estimate the source spectra differently for gorithm. In our implementation, we use a frame length of
the overlapped and non-overlapped harmonics. For all non- 4096 samples with sampling frequency 44.1 kHz. No zero-
overlapped harmonics, we directly distribute the mixture en- padding is used in the DFT. The frame shift is 1024 samples.
ergy to the source estimate, We choose θf = 1.5fb , one and half times the frequency
resolution of the DFT. The number of harmonics for each
Ŷnno (m, k) = X(m, k) ∀k ∈ Knhn (m). (16) source, Hn , is chosen such that fnHn (m) < f2s for all time
frames m, where fs denotes the sampling frequency.
For the overlapped harmonics, we utilize the sinusoidal Performance results are shown in Table 1. The first row
model and calculate the spectrogram using of the table is the SNR gain for the two source mixtures
achieved by the Virtanen system [1], which is also based
Ŷno (m, k) = Snhn (m)W (kfb − fnhn (m)). (17) on sinusoidal modeling. At each frame, this approach uses
pitch information and the least-squares objective to simulta-
Finally, the overall source spectrogram is Ŷn = Ŷnno + Ŷno neously estimate the amplitudes and phases of the harmon-
and we use the overlap-add technique to obtain the time- ics of all instruments. A so-called adaptive frequency-band
domain estimate, ŷn (t), for each source. model is used to estimate the parameters of overlapped har-
monics. To avoid inaccurate implementation of this system,
we asked the author to provide separated signals for our set
5 EVALUATION of test mixtures. The second row in Table 1 shows the SNR
gain achieved by our system. On average, our approach
5.1 Database achieved a 14.5 dB SNR improvement, 3.4 dB higher than
To evaluate the proposed system, we constructed a database the Virtanen system. The third row shows the SNR gain of
of 20 quartet pieces by J. S. Bach. Since it is difficult to our system on the three source mixtures. Note that all results
obtain multi-track recordings, we synthesize audio signals were obtained using ground truth pitches. Sound demos of
from MIDI files using samples of individual notes from the the our separation system can be found at: www.cse.ohio-
RWC music instrument database [15]. For each line se- state.edu/∼woodrufj/mmss.html
lected from the MIDI file, we randomly assign one of four
instruments: clarinet, flute, violin or trumpet. For each note 6 DISCUSSION AND CONCLUSION
in the line, a sample with the closest average pitch is se-
lected from the database for the chosen instrument. We cre- In this paper we have proposed an algorithm for resolving
ate two source mixtures (using the alto and tenor lines from overlapping harmonics based on CAM and phase change
the MIDI file) and three source mixtures (soprano, alto and estimation from pitches. We incorporate the algorithm in a
tenor), and select the first 5-seconds of each piece for eval- separation system and quantitative results show significant
uation. All lines are mixed to have equal level, thus lines in improvement in terms of SNR gain relative to an existing
542
ISMIR 2008 – Session 4c – Automatic Music Analysis and Transcription
monaural music separation system. In addition to large in- [6] M. Bay and J. W. Beauchamp, “Harmonic source sep-
creases in SNR, the perceptual quality of the separated sig- aration using prestored spectra,” in Independent Com-
nals is quite good in most cases. Because reconstruction of ponent Analysis and Blind Signal Separation, 2006, pp.
overlapped harmonics is accurate and we utilize the mixture 561–568.
for non-overlapped harmonics, the proposed system does
not alter instrument timbre in the way that synthesis with [7] A. S. Bregman, Auditory Scene Analysis. Cambridge,
a bank of sinusoids can. A weakness of the proposed ap- MA: MIT Press, 1990.
proach is the introduction of so-called musical noise as per- [8] D. L. Wang and G. J. Brown, Eds., Computational Au-
formance degrades. One aspect of future work will be to ditory Scene Analysis: Principles, Algorithms, and Ap-
address this issue and create higher quality output signals. plications. Hoboken, NJ: Wiley/IEEE Press, 2006.
In this study we assume that the pitches of sources are
known. However, for practical applications, the true pitches [9] H. Viste and G. Evangelista, “Separation of harmonic
of sources in a mixture are not available and must be esti- instruments with overlapping partials in multi-channel
mated. Since our model uses pitch to identify overlapped mixtures,” in IEEE Workshop on Applications of Signal
and non-overlapped harmonics and pitch inaccuracy affects Processing to Audio and Acoustics, 2003, pp. 25–28.
both the least-squares estimation and phase change predic-
tion, good performance is reliant on accurate pitch estima- [10] J. Woodruff and B. Pardo, “Using pitch, amplitude
tion. We are currently investigating methods that relax the modulation and spatial cues for separation of harmonic
need for accurate prior knowledge of pitch information. Pre- instruments from stereo music recordings,” EURASIP
liminary results suggest that performance similar to the Vir- Journal on Advances in Signal Processing, vol. 2007,
tanen system using ground truth pitch can still be achieved 2007.
by our approach even with prior knowledge of only the num-
[11] S. A. Abdallah and M. D. Plumbley, “Unsupervised
ber of sources (when combining our system with multi-pitch
analysis of polyphonic music by sparse coding,” IEEE
detection) or rough pitch information (as provided by MIDI
Transactions on Neural Networks, vol. 17, no. 1, pp.
data).
179–196, 2006.
[12] T. Virtanen, “Monaural sound source separation by non-
Acknowledgment negative matrix factorization with temporal continuity
and sparseness criteria,” IEEE Transactions on Audio,
The authors would like to thank T. Virtanen for his assis- Speech, and Language Processing, vol. 15, no. 3, pp.
tance in sound separation and comparison. This research 1066–1074, 2007.
was supported in part by an AFOSR grant (F49620-04-1-
0027) and an NSF grant (IIS-0534707). [13] R. McAulay and T. Quatieri, “Speech analysis/synthesis
based on a sinusoidal representation,” IEEE Transac-
tions on Acoustics, Speech, and Signal Processing,
7 REFERENCES vol. 34, no. 4, pp. 744–754, 1986.
[1] T. Virtanen, “Sound source separation in monaural mu- [14] X. Serra, “Musical sound modeling with sinusoids plus
sic signals,” Ph.D. dissertation, Tampere University of noise,” in Musical Signal Processing, C. Roads, S. Pope,
Technology, 2006. A. Picialli, and G. Poli, Eds. Lisse, The Netherlands:
Swets & Zeitlinger, 1997.
[2] E. M. Burns, “Intervals, scales, and tuning,” in The Psy-
chology of Music, D. Deutsch, Ed. San Diego: Aca- [15] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka,
demic Press, 1999. “RWC music database: Music genre database and mu-
sical instrument sound database,” in International Con-
[3] A. Klapuri, “Multiple fundamental frequency estimation ference on Music Information Retrieval, 2003.
based on harmonicity and spectral smoothness,” IEEE
Transactions on Speech and Audio Processing, vol. 11, [16] Y. Li and D. L. Wang, “Pitch detection in polyphonic
no. 6, pp. 804–816, 2003. music using instrument tone models,” in IEEE Interna-
tional Conference on Acoustics, Speech, and Signal Pro-
[4] T. Virtanen and A. Klapuri, “Separation of harmonic cessing, 2007, pp. II.481–484.
sounds using multipitch analysis and iterative parameter
estimation,” in IEEE Workshop on Applications of Sig-
nal Processing to Audio and Acoustics, 2001, pp. 83–86.
543