29 views

Uploaded by rakeshsharma1

- Matrices
- Module 07 - Effects and Plug-Ins
- MegaMagic Bells_Winds K5 READ ME
- Cardiac Arrhythmia Beat Classification Using DOST and PSO Tuned SVM
- Ansys2Excite_UsersGuide.pdf
- Mathcad - CAPE - 2006 - Math Unit 2 - Paper 01
- Jecklin Disk
- Linear Algebra Solutions
- Ibm Placement Papers
- syllabus
- VAV CAV Documentation NS
- MannyM-EQ
- Speaker Hk Lucas
- Efficient System Reliability Analysis Illustrated for a Retaining Wall and a Soil Slope 2011 Computers and Geotechnics
- Vt737sp_Manual_2016.pdf
- Dark Matter Manual
- 2010-12-09-20-05-29
- Application of Modal Sensitivity for Power System Harmonic Resonance Analysis
- MannyM-EQ
- Syllabus IITJEE 2012

You are on page 1of 4

Center for Research in Computing and the Arts (CRCA)

California Institute for Telecommunications and Information Technology Cal-(IT)2

University of California, San Diego

ABSTRACT the center position of a stereo mix, and this process could

completely eliminate the bass, which is also undesirable.

This paper presents a new approach and prototype system

There are also other methods like applying EQs to the

for karaoke vocal or instrument removal. The algorithm

original recording attenuating the vocal frequency bands

estimates the singer’s location and spatial transfer func-

of the spectrum. This cannot remove the vocal completely

tion from a stereo recording, then mutes components in

and the spectral composition of the original material is al-

the recording according to the estimated orientation. At

tered a lot.

the current stage, estimation is obtained by analyzing solo

segments in the recording. This algorithm turns to be In our approach, we assume that the vocal component

robust against reverb and spatial acoustics. To compen- is stabilized in a spatial location and recorded with a pair

sate low frequency loss that usually appears after vocal of microphones. As soon as we can infer the transfer func-

removal, the original recording is low-pass filtered around tion associated with its location based on the stereo signal,

120Hz and mixed back with the processed mono sound to in theory we will be able to manipulate components in that

generate a stereo karaoke track. location according to different needs. The following sec-

tions will describe our model in detail and explain some

experimental results with existing stereo recordings. In

1. INTRODUCTION this paper, “vocal” could mean the lead singer or the lead

instrument that is to be removed in a stereo recording.

In karaoke, a popular entertainment, the quality of the ac-

companying sound track directly affects the aesthetic ex-

perience of the “user singer”. Reproduction of an accom- 2. METHODOLOGY

paniment requires lots of time and labor, yet the music is

probably not close enough to the original mix. Thus, re- 2.1. System architecture and initial assumptions

moving the vocal or lead instrument track from the origi-

nal recording remains an attractive topic. The organization of the system is shown in Figure 1. First,

Currently, the main method to achieve this goal, both an algorithm tries to identify segments in the recording

in hardware and software, is to generate a mono sound where the vocal is playing solo, or at least contributes

track by subtracting the left and right channel of the orig- most of the energy. These segments are analyzed for es-

inal recording, and hope that the vocal component will timating the transfer function from the vocal’s location to

disappear. This method is patented in 1996 [1]. Many each of the microphones. After that, sound in the esti-

karaoke machines and software like the AnalogX Vocal mated location can be canceled or suppressed by project-

Remover, WinOKE, and the YoGen Vocal Remover work ing the stereo recording on appropriate directions in differ-

in this way. This job can also be done in any sound edi- ent frequency bands, resulting in a mono, vocal-suppressed

tor. The results are usually not satisfying, because it only sound track. Some additional spectral corrections can be

works if the vocal component is identical at all times in done to this to further remove residual high frequency

both channels of the stereo recording, i.e. a mono source components of the vocal, such as various consonants. Fi-

panned right at the center. This assumption doesn’t apply nally, the bass part in the original recording, which is not

to all situations especially live concert recordings where overlapped with the vocal in frequency will be added back

the full stereo image is picked up only by a pair of micro- to the mono file.

phones. The singer isn’t necessarily standing in the cen- We assume the recording situation is a live stereo pro-

ter, and room acoustics can make the vocal non-identical duction session where multiple instruments and the vocal-

in left and right channel, which means having different ist are stabilized in separate locations. Reverb and station-

amplitudes and delays in different frequency bands. Even ary room noise are assumed as well. This scene is more

more, artificial reverb could also have been added to the intuitively illustrated in Figure 2 where solid lines repre-

vocal component in the recording. All these conditions sent direct sound and dashed lines represent some of the

will cause the subtraction method to fail. Another prob- early reflections. With reasonable reverb time, our model

lem is that low frequency instruments are usually mixed in is robust to additive noise, room reverberation and other

1 2

3

L R

vn (t), respectively. Here t denotes the STFT window po-

sition. The temporal transfer function of the mth source

signal to the microphone n is defined as

Z

Figure 1. System diagram of our vocal removal design . b

Anm (ω) = anm (τ )e−jωτ dτ = b anm (ω)e−jωδnm (ω)

(5)

factors that will be considered in the discussion section. where we define

anm (ω) = kAnm (ω)k; δbnm (ω) = 6 Anm (ω)

b (6)

2.2. Source separation model with transfer function

In matrix notation, the model (1) can be written as

A typical model for blind source separation of an N -channel

sensor signal x(t) arise from M unknow scalar source sig- X(t, ω) = A(ω)S(t, ω) + V(t, ω) (7)

nals s(t), in the case of instantaneous mixture, is described

In our case, N = 2, so that (7) in detail looks like

by · ¸

x(t) = As(t) + v(t) (1) X1 (t, ω)

=

X2 (t, ω)

where " #

b b

a11 (ω)e−jωδ11 (ω) · · · b

b a1M (ω)e−jωδ1M (ω)

x1 (t) s1 (t) b b

.. .. a21 (ω)e−jωδ21 (ω) · · · b

b a2M (ω)e−jωδ2M (ω)

x(t) = . , s(t) = . (2)

S1 (t, ω) · ¸

xN (t) sM (t) .. V1 (t, ω)

· . + (8)

V2 (t, ω)

and A is a N ×M linear mixing matrix and v(t) is a zero- SM (t, ω)

mean, white additive noise. In convolutive environment

Without loss of generality, we can absorb the attenuation

such as the recording session situation, signals at different

and delay parameters for each frequency bin in the first

locations will have different transfer functions:

microphone signal x1 (t) into the definition of the source.

M Z

X In this way, (8) can be rewritten as

xn (t) = anm (τ )sm (t − τ )dτ + vn (t)

m=1

· ¸

X1 (t, ω)

n = 1, . . . , N (3) =

X2 (t, ω)

· ¸

where amn (τ ) is the impulse response of the transfer 1 ··· 1

function from the mth source signal to the nth micro- a1 (ω)e−jωδ1 (ω) aM (ω)e−jωδM (ω)

···

phone. Short Time Fourier Transform (STFT) of (3) turns S1 (t, ω) · ¸

this into instantaneous mixture separately for every fre- .. V1 (t, ω)

· . + (9)

quency, giving V2 (t, ω)

SM (t, ω)

M

X In (9), suppose the vocal component is the kth source,

Xn (t, ω) = Anm (ω)Sm (t, ω) + Vn (t, ω)

which is associated ·

with STFT Sk (t, ω),¸ if we can esti-

m=1

1

n = 1, . . . , N (4) mate the kth vector −jωδk (ω) in A, then left

ak (ω)e

multiplying by a vector that is orthogonal to it will com- One can note that in principle the transfer function can be

pletely remove the kth source, achieving the initial goal. estimated by dividing the two components of the eigen-

vector. As will be discussed later, an advantage of our

2.3. Estimating the vocal component location method is that it is robust to added noise. The correlation

method can be easily extended for the case of higher order

With solo segments of the vocal extracted and their STFT statistics [4] by generalizing equation (12). For example,

time-frequency cells available, it is possible to estimate for the case of 4th order cumulants, we construct a matrix

the unstructured spatial transfer function for each frequency X(ω)X 3 (ω)H − 3X(ω)X(ω)H . It can be shown that for

[2]. Variety of methods to estimate the transfer function the case of Gaussian signal, this matrix equals zero since

are being explored. Considering equation (9) in case of 4th cumulant of a Gaussian signal equals three times the

a single source allows estimation of the transfer function 2nd cumulant (correlation). This effectively eliminates the

parameters by division of the STFT’s of the left and right additional noise matrix from equation (13). The eigenvec-

channels, giving tors of the resulting matrix are same as for the correlation

° ° case.

° X2 (ω) °

ak (ω) = °° ° (10)

X1 (ω) °

2.4. Vocal removal

and ~ k (ω), we find a vector orthogonal to it

1 X2 (ω) After obtaining A

δk (ω) = − =(log( )) (11) £ ¤

ω X1 (ω) ~⊥=

A k −ak (ω)e−jωδk (ω) 1 (18)

This simple method of STFT division [3] is not robust to

noise and gives unsatisfactory results. Here we describe and then left multiply the microphone signal in (9) with

in detail an autocorrelation method that is robust to uncor- ~ ⊥ to remove the kth component:

A k

related additive noise. The autocorrelation matrix of (7) at · ¸

a given time-frequency cell will be obtained by averaging ~ ⊥ X1 (t, ω)

Ak =

the following equation: X2 (t, ω)

£ ¤

~ ⊥A

A ~ 1 · · · 0(kth) · · · A ~ ⊥A~M

X(t, ω)X(t, ω)H = A(ω)S(t, ω)S(t, ω)H A(ω)H

k

k

S1 (t, ω) · ¸

+ A(ω)S(t, ω)V(t, ω)H . ~ ⊥ V1 (t, ω)

· .

. + Ak (19)

+ V(t, ω)S(t, ω)H A(ω)H V2 (t, ω)

SM (t, ω)

+ V(t, ω)V(t, ω)H (12)

One special case of this is when only two components ex-

We have assumed that signals and noise are uncorrelated. ist in the stereo recording, in which we can extract the two

Denoting Rx , Rs and Rv as the correlation matrices of sources individually. Both sides of (19) are one dimen-

the microphones, source signals and noise respectively, sional, so that the resulting sound is mono.

and considering signals as stationary, after averaging over

time window, we obtain

2.5. Solo segment extraction

H

Rx (ω) = A(ω)Rs (ω)A(ω) + Rv (ω) (13) As the starting point of the estimation, solo segments in

a song carry all the information of the unknown trans-

Here we assume that the noise is white, so that matrix

fer functions. In order to use the algorithm described in

Rv (ω) is diagonal and Rv (ω) = σv2 IN , ∀ω where IN is

2.2, it’s important to have a good extraction algorithm for

an identity matrix of size N . In segments where only one

the solo segments. There are several approaches avail-

component sk exists, (13) becomes

able. We can use the Gaussian Mixture Model in [2] to

~ k (ω)A

~ k (ω)H + σ 2 IN find time-frequency cells with only one source, or take the

Rx (ω) = σs2k A v (14)

W-disjoint assumption as indicated in [3] and and look for

where clusters near the center of the stereo field. Voice activ-

· ¸

1 ity detection (VAD) algorithms can be utilized as well [5].

~ k (ω) =

A (15)

ak (ω)e−jωδk (ω) Implementation for solo segment extraction is currently

manual.

~ k (ω), we get

Right multiply (14) with A

~ k (ω) = (σ 2 A

~ H~ 2 ~ 2.6. Delay refinement and bass emphasis

Rx (ω)A sk k (ω) Ak (ω) + σv )Ak (ω) (16)

Estimation accuracy will start to decrease in high frequency

~ k (ω) can be estimated from eigenvalues

This means that A bins as the wavelength of the signal becomes so short that

of the rank-1 matrix Rx (ω). The corresponding eigen- transfer functions become sensitive to movements of the

value is singer or instrument performer. This will result in some

~ k (ω)H A

λ = σs2k A ~ k (ω) + σ 2 (17) residual transients (consonant attacks) of the vocal in the

v

processed sound. A method for delay refinement is be- it’s linear. For a typical live recording session, there are

ing used on a frame by frame basis in order to search for a usually close mics for individual instruments. Usually,

value of optimal delay correction that might happen due to these mic signals are later added to the stereo recording

minor movements of the sound source. By searching over for the whole ensemble to boost certain instruments. This

range of possible delay parameters around the theoretical is still a linear operation which well fits in our assumptions

value we are looking for minimum energy of the resulting for our model to work well in theory. We also have con-

signal, which will occur if the source is more precisely sidered delay refinement in order to compensate for small

removed. movements of the source. However, many more detailed

Removing the vocal component in the manner described properties of the acoustics are yet to be investigated, such

in this paper usually remove a certain amount of bass from as the geometrical conditions and spatial sampling dis-

the recording at the same time, since low frequency instru- tance between the microphones. STFT window types and

ments are often not very localized sources and their loca- window sizes also matter. The window should at least be

tions could coincide with the vocal. Another fact is that suitable for perfect reconstruction with proper hop length

vocal usually doesn’t occupy the same frequency band as - a hamming window for example. The window length

the bass. The fundamental frequency of voiced speech should be longer than the longest possible reverb time so

varies according to a lot of factors like emotion and age. that the transfer function model holds for most portion of

Literature [6] shows that the typical range is 85-155Hz for a window, instead of letting to much convoluted tails fall

adult male and 165-255 for female. Taking this as a ref- into neighboring windows. On the other hand, lengthened

erence, we choose a cut-off frequency of 120Hz to low- window will result in fewer STFT time slices for the au-

pass filter the original stereo material, and mix it with the tocorrelation time average. As a result, the decision of

mono vocal-free track. The filtered material won’t con- window length and shape should consider all these trade-

tain much of the vocal and still will contain most of the offs. Further more, solo segment extraction algorithms

bass information as well as preserve the stereo field in the are to be compared and tested to automate the extraction

low frequencies. Simulated results shows that this method process, and delay refinement and transient removal algo-

is very effective for the compensation of low frequencies rithms are under development.

and it increases some stereo feeling of the accompanying

track. 5. REFERENCES

3. IMPLEMENTATION AND PROTOTYPE

canceler with simulated stereo output 1996.

This system design is implemented in Matlab. Several CD U.S. Pat. 05550920.

recordings of different genres have been tested and com- [2] Dubnov, S., Tabrikian, J., Arnon-Targan,

pared to the existing method (left-right subtraction). We M. “A Method for Directionally-Disjoint

chose hamming window for the STFT and window lengths Source Separation in Convolutive Environ-

of 1024 and 2048. For songs with heavy post productions ment”, ICASSP, Vol. 5 , pp. 489 C 492, Mon-

where the vocal is recorded mono and panned to the center treal, Canada, 2004

with few reverb, our system works similarly to the existing

method of vocal removal, but the bass emphasis enhances [3] Jourjine, A., Rickard, S., and Yilmaz, O.

the final sound quality noticeably. In recording situations “Blind Separation of Disjoint Orthogonal Sig-

more similar to our initial assumptions as are mentioned nals: Demixing N Sources from 2 Mixtures”,

in 2.1, for example a concert recording of Billie Holiday ICASSP, Vol. 5, pp. 2985-2988, Istanbul,

with a band, the existing method fails to remove the vocal Turkey, June 2000

at all, while our system is as robust as the in the previous

situation. Some sound examples are available online. 1 [4] Mendel, J.M. “Tutorial on higher-order statis-

tics (spectra) in signal processing and system

theory: theoretical results and some applica-

4. DISCUSSION AND FUTURE RESEARCH tions”, Proceedings of the IEEE, Vol. 79, Issue

3, pp. 278 - 305, March 1991

The prototype system turns out to be well functioning, es-

pecially when the real situation is close to the assump- [5] Fisher, E., Tabrikian, J., Dubnov, S. “Gener-

tions. As can be seen from equation (16), added noise alized likelihood ratio test for voiced/unvoiced

doesn’t compromise the accuracy of estimating A ~ k (ω) decision using the harmonic plus noise model”

from the eigenvector of Rx (ω). So, this algorithm is ro- ICASSP, vol.1, pp. 440-443, 2003

bust in a noisy environment. The model in the system [6] Baken, R. J. Clinical Measurement of Speech

consider different attenuations and delays for different fre- and Voice. London: Taylor and Francis Ltd.,

quencies, and transfer function is assumed, so it is also 1987.

robust to reverberant vocal sound, even artificial reverb, if

1 http://crca.ucsd.edu/˜pxiang/research/vocalremoval.html

- MatricesUploaded byvignanaraj
- Module 07 - Effects and Plug-InsUploaded byPeter Pereira Marques
- MegaMagic Bells_Winds K5 READ MEUploaded byvincenzovaia
- Cardiac Arrhythmia Beat Classification Using DOST and PSO Tuned SVMUploaded bylogu_thalir
- Ansys2Excite_UsersGuide.pdfUploaded bydeepali0305
- Mathcad - CAPE - 2006 - Math Unit 2 - Paper 01Uploaded byJerome JAckson
- Jecklin DiskUploaded byTity Cristina Vergara
- Linear Algebra SolutionsUploaded byKaze1050
- Ibm Placement PapersUploaded byOm Sai Vatika Noida
- syllabusUploaded byViren Singh
- VAV CAV Documentation NSUploaded byHarish Menon
- MannyM-EQUploaded byChris Funk
- Speaker Hk LucasUploaded byVijay Kumar
- Efficient System Reliability Analysis Illustrated for a Retaining Wall and a Soil Slope 2011 Computers and GeotechnicsUploaded byJuan Castro
- Vt737sp_Manual_2016.pdfUploaded byanon_726104744
- Dark Matter ManualUploaded bygarthog
- 2010-12-09-20-05-29Uploaded byBinu Kaani
- Application of Modal Sensitivity for Power System Harmonic Resonance AnalysisUploaded bycc
- MannyM-EQUploaded byJesús Tadeo Zambrano Vernay
- Syllabus IITJEE 2012Uploaded byshitajit
- p255_en_om_a0Uploaded byespguy3
- 1569191018Uploaded byRasool Reddy
- uniquestewi40001.03_00Uploaded bygamasax
- 04aLab4_Amadeus.pdfUploaded byWalid_Sassi_Tun
- ANW0741_MN_RFPadEQSpecs(1)Uploaded byferchll
- pem22.pdfUploaded byMomenNaeem
- Maja Dakic-Brkovic - Week 3 Assignment IMP MOOCUploaded bymajadb
- Physics SyllabusUploaded bySanjeev Kumar
- ch1Uploaded byChristopher David Levy
- ElementaryRowOperationsUploaded bySilverwolf Cerberus

- DigiTech RP90 ManualUploaded bythatmentalguy
- manual of standards for aerodomes-caapUploaded byCoraline Bravo
- IHP GP Clinic ListUploaded byYueZhen Chua
- Assignment on LFAUploaded byUjjwal Maghaiya
- Fiber Optics LectureUploaded byShinoj
- Ercoftac Bulletin 78Uploaded byalexl195
- 0000-2006 a Guided Tour Into Sub Cellular Colocalization Analysis in LightUploaded bygangli2011
- [JD] Finance Executive - OGX - TMUploaded byAnonymous IBqIaIZG
- MEGABRAS _ MTD-20KWe _ Earth Resistance and Resistivity Meter_ CharaUploaded byMarlon Aliaga
- Awards of Excellence in Housing finalistsUploaded byedmontonjournal
- ICCP Vitrinite.pdfUploaded byaditya256
- Prayer Brings MiraclesUploaded byJoshua Tagayuna
- Phenomenological Hysteretic Model for Corroded Reinforcing Bars Including Inelastic Buckling and Low-cycle Fatigue Degradation by KashaniUploaded byRaúl León Medina
- Rule: Class E airspaceUploaded byJustia.com
- Http Www.arkat-usaUploaded byGhayoor Abbas
- Section 11 - Prefabricated GirdersUploaded byfivehours5
- MBA RM SyllabusUploaded bySatyanarayana Amarapinni
- Application of TE011 Mode Cylindrical Resonator for Complex Permittivity Estimation of Dielectric MaterialsUploaded bysackings_98
- 24 Modeling Customer Satisfaction in Cellular Phone ServicesUploaded byFayaz Ali
- elx_qg_all_hpUploaded byJoseph Lu
- DST-S Sample ReportUploaded byravibhargavaraam
- Chapter 1- Project Management Concepts(1)Uploaded byAndriansyah
- 01 Intact CaseinUploaded byPat Ferrer
- Curriculum ViateUploaded byOssa
- BSC Risk AssessmentUploaded byPradeep Nair
- 06- PDA - Is It a Good Pile Test (Prof Harry Tan)Uploaded byYasonsky Captain
- lion huntingUploaded byapi-260648606
- Variant Firearms Rules-3.5 or PathfinderUploaded byAlec Wallack
- nullUploaded byapi-25895447
- Correctness Analysis1Uploaded byvkry007