You are on page 1of 4



Pei Xiang and Shlomo Dubnov

Center for Research in Computing and the Arts (CRCA)
California Institute for Telecommunications and Information Technology Cal-(IT)2
University of California, San Diego

ABSTRACT the center position of a stereo mix, and this process could
completely eliminate the bass, which is also undesirable.
This paper presents a new approach and prototype system
There are also other methods like applying EQs to the
for karaoke vocal or instrument removal. The algorithm
original recording attenuating the vocal frequency bands
estimates the singer’s location and spatial transfer func-
of the spectrum. This cannot remove the vocal completely
tion from a stereo recording, then mutes components in
and the spectral composition of the original material is al-
the recording according to the estimated orientation. At
tered a lot.
the current stage, estimation is obtained by analyzing solo
segments in the recording. This algorithm turns to be In our approach, we assume that the vocal component
robust against reverb and spatial acoustics. To compen- is stabilized in a spatial location and recorded with a pair
sate low frequency loss that usually appears after vocal of microphones. As soon as we can infer the transfer func-
removal, the original recording is low-pass filtered around tion associated with its location based on the stereo signal,
120Hz and mixed back with the processed mono sound to in theory we will be able to manipulate components in that
generate a stereo karaoke track. location according to different needs. The following sec-
tions will describe our model in detail and explain some
experimental results with existing stereo recordings. In
1. INTRODUCTION this paper, “vocal” could mean the lead singer or the lead
instrument that is to be removed in a stereo recording.
In karaoke, a popular entertainment, the quality of the ac-
companying sound track directly affects the aesthetic ex-
perience of the “user singer”. Reproduction of an accom- 2. METHODOLOGY
paniment requires lots of time and labor, yet the music is
probably not close enough to the original mix. Thus, re- 2.1. System architecture and initial assumptions
moving the vocal or lead instrument track from the origi-
nal recording remains an attractive topic. The organization of the system is shown in Figure 1. First,
Currently, the main method to achieve this goal, both an algorithm tries to identify segments in the recording
in hardware and software, is to generate a mono sound where the vocal is playing solo, or at least contributes
track by subtracting the left and right channel of the orig- most of the energy. These segments are analyzed for es-
inal recording, and hope that the vocal component will timating the transfer function from the vocal’s location to
disappear. This method is patented in 1996 [1]. Many each of the microphones. After that, sound in the esti-
karaoke machines and software like the AnalogX Vocal mated location can be canceled or suppressed by project-
Remover, WinOKE, and the YoGen Vocal Remover work ing the stereo recording on appropriate directions in differ-
in this way. This job can also be done in any sound edi- ent frequency bands, resulting in a mono, vocal-suppressed
tor. The results are usually not satisfying, because it only sound track. Some additional spectral corrections can be
works if the vocal component is identical at all times in done to this to further remove residual high frequency
both channels of the stereo recording, i.e. a mono source components of the vocal, such as various consonants. Fi-
panned right at the center. This assumption doesn’t apply nally, the bass part in the original recording, which is not
to all situations especially live concert recordings where overlapped with the vocal in frequency will be added back
the full stereo image is picked up only by a pair of micro- to the mono file.
phones. The singer isn’t necessarily standing in the cen- We assume the recording situation is a live stereo pro-
ter, and room acoustics can make the vocal non-identical duction session where multiple instruments and the vocal-
in left and right channel, which means having different ist are stabilized in separate locations. Reverb and station-
amplitudes and delays in different frequency bands. Even ary room noise are assumed as well. This scene is more
more, artificial reverb could also have been added to the intuitively illustrated in Figure 2 where solid lines repre-
vocal component in the recording. All these conditions sent direct sound and dashed lines represent some of the
will cause the subtraction method to fail. Another prob- early reflections. With reasonable reverb time, our model
lem is that low frequency instruments are usually mixed in is robust to additive noise, room reverberation and other
1 2


Figure 2. Schematic of a live stereo recording session

where Sm (t, ω) and Vn (t, ω) are the STFTs of sm (t) and

vn (t), respectively. Here t denotes the STFT window po-
sition. The temporal transfer function of the mth source
signal to the microphone n is defined as
Figure 1. System diagram of our vocal removal design . b
Anm (ω) = anm (τ )e−jωτ dτ = b anm (ω)e−jωδnm (ω)
factors that will be considered in the discussion section. where we define
anm (ω) = kAnm (ω)k; δbnm (ω) = 6 Anm (ω)
b (6)
2.2. Source separation model with transfer function
In matrix notation, the model (1) can be written as
A typical model for blind source separation of an N -channel
sensor signal x(t) arise from M unknow scalar source sig- X(t, ω) = A(ω)S(t, ω) + V(t, ω) (7)
nals s(t), in the case of instantaneous mixture, is described
In our case, N = 2, so that (7) in detail looks like
by · ¸
x(t) = As(t) + v(t) (1) X1 (t, ω)
X2 (t, ω)
where " #
b b
    a11 (ω)e−jωδ11 (ω) · · · b
b a1M (ω)e−jωδ1M (ω)
x1 (t) s1 (t) b b
 ..   ..  a21 (ω)e−jωδ21 (ω) · · · b
b a2M (ω)e−jωδ2M (ω)
x(t) =  .  , s(t) =  .  (2)  
S1 (t, ω) · ¸
xN (t) sM (t)  ..  V1 (t, ω)
· .  + (8)
V2 (t, ω)
and A is a N ×M linear mixing matrix and v(t) is a zero- SM (t, ω)
mean, white additive noise. In convolutive environment
Without loss of generality, we can absorb the attenuation
such as the recording session situation, signals at different
and delay parameters for each frequency bin in the first
locations will have different transfer functions:
microphone signal x1 (t) into the definition of the source.
X In this way, (8) can be rewritten as
xn (t) = anm (τ )sm (t − τ )dτ + vn (t)
· ¸
X1 (t, ω)
n = 1, . . . , N (3) =
X2 (t, ω)
· ¸
where amn (τ ) is the impulse response of the transfer 1 ··· 1
function from the mth source signal to the nth micro- a1 (ω)e−jωδ1 (ω) aM (ω)e−jωδM (ω)
 
phone. Short Time Fourier Transform (STFT) of (3) turns S1 (t, ω) · ¸
this into instantaneous mixture separately for every fre-  ..  V1 (t, ω)
· .  + (9)
quency, giving V2 (t, ω)
SM (t, ω)
X In (9), suppose the vocal component is the kth source,
Xn (t, ω) = Anm (ω)Sm (t, ω) + Vn (t, ω)
which is associated ·
with STFT Sk (t, ω),¸ if we can esti-
n = 1, . . . , N (4) mate the kth vector −jωδk (ω) in A, then left
ak (ω)e
multiplying by a vector that is orthogonal to it will com- One can note that in principle the transfer function can be
pletely remove the kth source, achieving the initial goal. estimated by dividing the two components of the eigen-
vector. As will be discussed later, an advantage of our
2.3. Estimating the vocal component location method is that it is robust to added noise. The correlation
method can be easily extended for the case of higher order
With solo segments of the vocal extracted and their STFT statistics [4] by generalizing equation (12). For example,
time-frequency cells available, it is possible to estimate for the case of 4th order cumulants, we construct a matrix
the unstructured spatial transfer function for each frequency X(ω)X 3 (ω)H − 3X(ω)X(ω)H . It can be shown that for
[2]. Variety of methods to estimate the transfer function the case of Gaussian signal, this matrix equals zero since
are being explored. Considering equation (9) in case of 4th cumulant of a Gaussian signal equals three times the
a single source allows estimation of the transfer function 2nd cumulant (correlation). This effectively eliminates the
parameters by division of the STFT’s of the left and right additional noise matrix from equation (13). The eigenvec-
channels, giving tors of the resulting matrix are same as for the correlation
° ° case.
° X2 (ω) °
ak (ω) = °° ° (10)
X1 (ω) °
2.4. Vocal removal
and ~ k (ω), we find a vector orthogonal to it
1 X2 (ω) After obtaining A
δk (ω) = − =(log( )) (11) £ ¤
ω X1 (ω) ~⊥=
A k −ak (ω)e−jωδk (ω) 1 (18)
This simple method of STFT division [3] is not robust to
noise and gives unsatisfactory results. Here we describe and then left multiply the microphone signal in (9) with
in detail an autocorrelation method that is robust to uncor- ~ ⊥ to remove the kth component:
A k
related additive noise. The autocorrelation matrix of (7) at · ¸
a given time-frequency cell will be obtained by averaging ~ ⊥ X1 (t, ω)
Ak =
the following equation: X2 (t, ω)
£ ¤
~ ⊥A
A ~ 1 · · · 0(kth) · · · A ~ ⊥A~M
X(t, ω)X(t, ω)H = A(ω)S(t, ω)S(t, ω)H A(ω)H 

S1 (t, ω) · ¸
+ A(ω)S(t, ω)V(t, ω)H  .  ~ ⊥ V1 (t, ω)
· .
.  + Ak (19)
+ V(t, ω)S(t, ω)H A(ω)H V2 (t, ω)
SM (t, ω)
+ V(t, ω)V(t, ω)H (12)
One special case of this is when only two components ex-
We have assumed that signals and noise are uncorrelated. ist in the stereo recording, in which we can extract the two
Denoting Rx , Rs and Rv as the correlation matrices of sources individually. Both sides of (19) are one dimen-
the microphones, source signals and noise respectively, sional, so that the resulting sound is mono.
and considering signals as stationary, after averaging over
time window, we obtain
2.5. Solo segment extraction
Rx (ω) = A(ω)Rs (ω)A(ω) + Rv (ω) (13) As the starting point of the estimation, solo segments in
a song carry all the information of the unknown trans-
Here we assume that the noise is white, so that matrix
fer functions. In order to use the algorithm described in
Rv (ω) is diagonal and Rv (ω) = σv2 IN , ∀ω where IN is
2.2, it’s important to have a good extraction algorithm for
an identity matrix of size N . In segments where only one
the solo segments. There are several approaches avail-
component sk exists, (13) becomes
able. We can use the Gaussian Mixture Model in [2] to
~ k (ω)A
~ k (ω)H + σ 2 IN find time-frequency cells with only one source, or take the
Rx (ω) = σs2k A v (14)
W-disjoint assumption as indicated in [3] and and look for
where clusters near the center of the stereo field. Voice activ-
· ¸
1 ity detection (VAD) algorithms can be utilized as well [5].
~ k (ω) =
A (15)
ak (ω)e−jωδk (ω) Implementation for solo segment extraction is currently
~ k (ω), we get
Right multiply (14) with A

~ k (ω) = (σ 2 A
~ H~ 2 ~ 2.6. Delay refinement and bass emphasis
Rx (ω)A sk k (ω) Ak (ω) + σv )Ak (ω) (16)
Estimation accuracy will start to decrease in high frequency
~ k (ω) can be estimated from eigenvalues
This means that A bins as the wavelength of the signal becomes so short that
of the rank-1 matrix Rx (ω). The corresponding eigen- transfer functions become sensitive to movements of the
value is singer or instrument performer. This will result in some
~ k (ω)H A
λ = σs2k A ~ k (ω) + σ 2 (17) residual transients (consonant attacks) of the vocal in the
processed sound. A method for delay refinement is be- it’s linear. For a typical live recording session, there are
ing used on a frame by frame basis in order to search for a usually close mics for individual instruments. Usually,
value of optimal delay correction that might happen due to these mic signals are later added to the stereo recording
minor movements of the sound source. By searching over for the whole ensemble to boost certain instruments. This
range of possible delay parameters around the theoretical is still a linear operation which well fits in our assumptions
value we are looking for minimum energy of the resulting for our model to work well in theory. We also have con-
signal, which will occur if the source is more precisely sidered delay refinement in order to compensate for small
removed. movements of the source. However, many more detailed
Removing the vocal component in the manner described properties of the acoustics are yet to be investigated, such
in this paper usually remove a certain amount of bass from as the geometrical conditions and spatial sampling dis-
the recording at the same time, since low frequency instru- tance between the microphones. STFT window types and
ments are often not very localized sources and their loca- window sizes also matter. The window should at least be
tions could coincide with the vocal. Another fact is that suitable for perfect reconstruction with proper hop length
vocal usually doesn’t occupy the same frequency band as - a hamming window for example. The window length
the bass. The fundamental frequency of voiced speech should be longer than the longest possible reverb time so
varies according to a lot of factors like emotion and age. that the transfer function model holds for most portion of
Literature [6] shows that the typical range is 85-155Hz for a window, instead of letting to much convoluted tails fall
adult male and 165-255 for female. Taking this as a ref- into neighboring windows. On the other hand, lengthened
erence, we choose a cut-off frequency of 120Hz to low- window will result in fewer STFT time slices for the au-
pass filter the original stereo material, and mix it with the tocorrelation time average. As a result, the decision of
mono vocal-free track. The filtered material won’t con- window length and shape should consider all these trade-
tain much of the vocal and still will contain most of the offs. Further more, solo segment extraction algorithms
bass information as well as preserve the stereo field in the are to be compared and tested to automate the extraction
low frequencies. Simulated results shows that this method process, and delay refinement and transient removal algo-
is very effective for the compensation of low frequencies rithms are under development.
and it increases some stereo feeling of the accompanying
track. 5. REFERENCES

[1] Nomura, Takashi, Denki, M., Kaisha, K. Voice

canceler with simulated stereo output 1996.
This system design is implemented in Matlab. Several CD U.S. Pat. 05550920.
recordings of different genres have been tested and com- [2] Dubnov, S., Tabrikian, J., Arnon-Targan,
pared to the existing method (left-right subtraction). We M. “A Method for Directionally-Disjoint
chose hamming window for the STFT and window lengths Source Separation in Convolutive Environ-
of 1024 and 2048. For songs with heavy post productions ment”, ICASSP, Vol. 5 , pp. 489 C 492, Mon-
where the vocal is recorded mono and panned to the center treal, Canada, 2004
with few reverb, our system works similarly to the existing
method of vocal removal, but the bass emphasis enhances [3] Jourjine, A., Rickard, S., and Yilmaz, O.
the final sound quality noticeably. In recording situations “Blind Separation of Disjoint Orthogonal Sig-
more similar to our initial assumptions as are mentioned nals: Demixing N Sources from 2 Mixtures”,
in 2.1, for example a concert recording of Billie Holiday ICASSP, Vol. 5, pp. 2985-2988, Istanbul,
with a band, the existing method fails to remove the vocal Turkey, June 2000
at all, while our system is as robust as the in the previous
situation. Some sound examples are available online. 1 [4] Mendel, J.M. “Tutorial on higher-order statis-
tics (spectra) in signal processing and system
theory: theoretical results and some applica-
4. DISCUSSION AND FUTURE RESEARCH tions”, Proceedings of the IEEE, Vol. 79, Issue
3, pp. 278 - 305, March 1991
The prototype system turns out to be well functioning, es-
pecially when the real situation is close to the assump- [5] Fisher, E., Tabrikian, J., Dubnov, S. “Gener-
tions. As can be seen from equation (16), added noise alized likelihood ratio test for voiced/unvoiced
doesn’t compromise the accuracy of estimating A ~ k (ω) decision using the harmonic plus noise model”
from the eigenvector of Rx (ω). So, this algorithm is ro- ICASSP, vol.1, pp. 440-443, 2003
bust in a noisy environment. The model in the system [6] Baken, R. J. Clinical Measurement of Speech
consider different attenuations and delays for different fre- and Voice. London: Taylor and Francis Ltd.,
quencies, and transfer function is assumed, so it is also 1987.
robust to reverberant vocal sound, even artificial reverb, if