Beruflich Dokumente
Kultur Dokumente
CorinnaCortes ,MehryarMohri
2,1
,MichaelRiley ,andAfshinRostamizadeh
GoogleResearch,
76NinthAvenue,NewYork,NY10011.
12Sciences,251MercerStreet,NewYork,NY
Courant Institute of Mathematical
10012.
Abstract. Thispaperpresentsatheoreticalanalysisofsampleselectionbiascor
rection.Thesamplebiascorrectiontechniquecommonlyusedinmachinelearning
consistsofreweightingthecostofanerroroneachtrainingpointofabiasedsample
tomorecloselyreflecttheunbiaseddistribution.Thisreliesonweightsderivedby
variousestimationtechniquesbasedonfinitesamples.Weanalyzetheeffectofan
errorinthatestimationontheaccuracyofthehypothesisreturnedbythelearning
algorithmfortwoestimationtechniques:aclusterbasedestimationtechniqueand
kernel mean matching. We also report the results of sample bias correction
experimentswithseveraldatasetsusingthesetechniques.Ouranalysisisbasedon
thenovelconceptofdistributionalstability whichgeneralizestheexistingconcept
ofpointbasedstability. Much ofourwork andprooftechniques canbeusedto
analyzeotherimportanceweightingtechniquesandtheireffectonaccuracywhen
usingadistributionallystablealgorithm.
Introduction
Inthestandardformulationofmachinelearningproblems,thelearningalgorithmre
ceivestrainingandtestsamplesdrawnaccordingtothesamedistribution.However,
thisassumptionoftendoesnotholdinpractice.Thetrainingsampleavailableisbi
asedinsomeway,whichmaybeduetoavarietyofpracticalreasonssuchasthecost
ofdatalabelingoracquisition.Theproblemoccursinmanyareassuchasastronomy,
econometrics,andspecieshabitatmodeling.
Inacommoninstanceofthisproblem,pointsaredrawnaccordingtothetestdis
tributionbut notall ofthem aremadeavailabletothelearner.Thisiscalledthe
sampleselectionbiasproblem.Remarkably,itisoftenpossibletocorrectthisbiasby
usinglargeamountsofunlabeleddata.
Theproblemofsampleselectionbiascorrectionforlinearregressionhasbeenex
tensivelystudiedineconometricsandstatistics(Heckman,1979;Little&Rubin,1986)
with the pioneering work of Heckman (1979). Several recentmachine learning publi
cations(Elkan,2001;Zadrozny,2004;Zadroznyetal.,2003;Fanetal.,2005;Dudket
al.,2006)havealsodealtwiththisproblem.Themaincorrectiontechniqueusedinallof
thesepublicationsconsistsofreweightingthecostoftrainingpointerrorstomoreclosely
reflectthatofthetestdistribution.Thisisinfactatechniquecommonlyusedinstatistics
andmachinelearningforavarietyofproblemsofthistype(Little&Rubin,1986).With
theexactweights,thisreweightingcouldoptimallycorrectthebias,but,inpractice,the
weightsarebasedonanestimateofthesamplingprobabilityfromfinitedatasets.Thus,it
isimportanttodeterminetowhatextenttheerrorinthisestimation
canaffecttheaccuracyofthehypothesisreturnedbythelearningalgorithm.Toour
knowledge,thisproblemhasnotbeenanalyzedinageneralmanner.
Thispapergivesatheoreticalanalysisofsampleselectionbiascorrection.Our
analysisisbasedonthenovelconceptof distributionalstability whichgeneralizes
thepointbasedstabilityintroducedandanalyzedbypreviousauthors(Devroye&
Wagner,1979;Kearns&Ron,1997;Bousquet&Elisseeff,2002).Weshowthat
large families of learning algorithms, including all kernelbased regularization
algorithmssuchasSupportVectorRegression(SVR)(Vapnik,1998)orkernelridge
regression (Saunders et al., 1998) are distributionally stable and we give the
expressionoftheirstabilitycoefficientforboththel1andl2distance.
Wethenanalyzetwocommonlyusedsamplebiascorrectiontechniques:acluster
basedestimationtechniqueandkernelmeanmatching(KMM)(Huangetal.,2006b).For
each ofthese techniques, we derive boundsonthe difference of the error rate ofthe
hypothesis returned by a distributionally stable algorithm when using that estimation
techniqueversususingperfectreweighting.Webrieflydiscussandcomparethesebounds
and also report theresults ofexperiments with both estimation techniques for several
publiclyavailablemachinelearningdatasets.Muchofourworkandprooftechniquescan
beusedtoanalyzeotherimportanceweightingtechniquesandtheireffectonaccuracy
whenusedincombinationwithadistributionallystablealgorithm.
Theremainingsectionsofthispaperareorganizedasfollows.Section2describes
in detail the sample selection bias correction technique. Section 3 introduces the
concept of distributional stabilityandprovesthedistributional stabilityof kernel
basedregularizationalgorithms.Section4analyzes the effect ofestimationerror
using distributionally stable algorithms for both the clusterbased and the KMM
estimationtechniques.Section5reportstheresultsofexperimentswithseveraldata
setscomparingtheseestimationtechniques.
SampleSelectionBiasCorrection
2.1
Problem
Let X denote the input space and Y the label set, which may be {0, 1} in
classificationoranymeasurablesubsetof R inregressionestimationproblems,and
letDdenotethetruedistributionoverXYaccordingtowhichtestpointsaredrawn.
Inthesampleselectionbiasproblem,somepairsz= (x, y)drawnaccordingtoDare
not made available to the learning algorithm. The learning algorithm receives a
trainingsample S of m labeledpoints z1, . . . , zm drawnaccordingtoa biased
bydefinitionofthesampleselectionbias,thesupportofthebiaseddistributionD is
includedinthatofthetruedistributionD.
Asinstandardlearningscenarios,theobjectiveofthelearningalgorithmistoselecta
hypothesishoutofahypothesissetHwithasmallgeneralizationerrorR(h)with
amountsofunlabeleddatadrawnaccordingtothetruedistributionD,whichisoften
availableinpractice.Thus,inthefollowingletUbeasampledrawnaccordingtoD
andSUalabeledbutbiasedsubsample.
2.2
WeightedSamples
AweightedsampleSwisatrainingsampleSofmlabeledpoints,z1, . . . , zmdrawni.i.d.
fromXY,thatisaugmentedwithanonnegativeweightwi0foreachpointzi.This
weightisusedtoemphasizeordeemphasizethecostofanerroron zi asin thesocalled
importanceweightingorcostsensitivelearning(Elkan,2001;Zadroznyetal.,2003).One
couldusetheweightswitoderiveanequivalentbutlargerunweightedsampleS where
the multiplicity of zi would reflect its weight wi, but most learning algorithms, e.g.,
decisiontrees,logisticregression,AdaBoost,SupportVectorMachines(SVMs),kernel
ridgeregression,candirectlyacceptaweightedsampleSw.Wewillrefertoalgorithms
thatcandirectlytakeSwasinputasweightsensitivealgorithms.
TheempiricalerrorofahypothesishonaweightedsampleSwisdefinedas
m
Rw (h) =
b
coincideswiththatofDand
Proposition1.LetD beadistribution
let)
forallpoints zi in S.
whosesupportSw beaweightedsample
Then,
with wi = PrD(zi)/ PrD (zi
i=1
wi c(h, zi).
(1)
pointsaredrawni.i.d.,
Proof.Sincethesample
b
E [Rw (h)] = 1
SD
E [wic(h, zi)] =
2.3
Bias
Correction
Theprobabilityof
drawing z = (x,
y) according to
the true but
unobserved
distribution D can
be
straightforwardly
related to the
observed
distributionD .By
definition of the
random variable
s, the observed
biased
distributionD can
be expressed by
PrD [z] = PrD[z|
s = 1]. We will
assume that all
points z in the
supportof D can
besampledwith a
nonzero
probabilitysothe
supportofDandD
coincide. Thus
for all z X
Y , Pr[s = 1|z]
=6 0. Then, by
the
Bayes
z 1D
z SD
(3)
Bydefinitionofw berewritten
andthefactthatthe asfollows
supportofDsidecan
Thislasttermis formula,
thedefinitionof
forall z in
the
generalization thesupport
errorR(h).
of D,
(2)
zD
z1D
Pr[z] =
D
(5)
Itisimportanttonotethatthiscorrectionisonlyneededforthetrainingsample S,
sinceitistheonlysourceofselectionbias.Withaweightsensitivealgorithm,itsuffices
i
toreweighteachsamplez withtheweightw =
Pr[s=1]
.Thus,Pr[s= 1|z]need
Pr[s=1|zi ]
and
R(h)
.
R(h)
Whenthesamplingprobabilityisindependentofthelabels,asitiscommonlyas
sumed in many settings (Zadrozny 2004; 2003), Pr[s = 1|z] = Pr[s = 1|x], and
Equation5canberewrittenas
Pr[z] =
D
Pr[s = 1] Pr[z].
Pr[s = 1 x] D
|
Inthatcase,theprobabilities Pr[s = 1] and Pr[s = 1|x] neededtoreconstitute PrD
fromPrDdonotdependonthelabelsandthuscanbeestimatedusingtheunlabeled
points in U . Moreover, as already mentioned, for weightsensitive algorithms, it
sufficestoestimatePr[s= 1|xi]forthepointsxiofthetrainingdata;nogeneralization
isneeded.
3
Asimplecaseiswhenthepointsaredefinedoveradiscreteset. Pr[s= 1|x]can
thenbeestimatedfromthefrequencymx/nx,wheremxdenotesthenumberoftimesx
appearedin S U and nx thenumberoftimes x appearedinthefulldataset U .
Pr[s = 1] canbeestimatedbythequantity |S|/|U |.However,since Pr[s = 1] isa
constantindependentofx,itsestimationisnotevennecessary.
IftheestimationofthesamplingprobabilityPr[s= 1|x]fromtheunlabeleddata
set U wereexact,thenthereweightingjustdiscussedcouldcorrectthesamplebias
optimally.Severaltechniqueshavebeencommonlyusedtoestimatethereweighting
quantities.But,theseestimateweightsarenotguaranteedtobeexact.Thenextsec
tionaddresseshowtheerrorinthatestimationaffectstheerrorrateofthehypothesis
returnedbythelearningalgorithm.
DistributionalStability
Here,wewillexaminetheeffectontheerrorofthehypothesisreturnedbythelearning
algorithminresponsetoachangeinthewaythetrainingpointsareweighted.Sincethe
weightsarenonnegative,wecanassumethattheyarenormalizedanddefineadistribu
tionoverthetrainingsample.Thisstudycanbeviewedasageneralizationofstability
analysiswhereasinglesamplepointischanged(Devroye&Wagner,1979;Kearns&
Ron,1997;Bousquet&Elisseeff,2002)tothemoregeneralcaseofdistributionalstability
wherethesample'sweightdistributionischanged.
Thus,inthissectionthesampleweightWofSWdefinesadistributionoverS.Fora
fixedlearningalgorithmLandafixedsampleS,wewilldenotebyhWthehypothesis
3
Thiscanbeasaresultofaquantizationorclustering
techniqueasdiscussedlater.
(6)
returnedbyLfortheweightedsampleSW.Wewilldenotebyd(W,W )adivergencemeasurefor
twodistributionsWandW .Therearemanystandardmeasuresforthedivergencesordistances
betweentwodistributions,includingtherelativeentropy,theHellingerdistance,andthe lp
distance.
Definition1(DistributionalStability).AlearningalgorithmLissaidtobedistributionally
stableforthedivergencemeasuredifforanytwoweightedsamplesSWandSW,
(7)
Proposition2.LetLbeadistributionallystablealgorithmandlethW(hW)denote
).
thehypothesisreturnedbyLwhentrainedontheweightedsampleSW
Let WT denotethedistributionaccordingtowhichtestpointsaredrawn.Then,the
(resp.SW
followingholds
(8)
(9)
whichimpliesthestatementoftheproposition.
3.1
DistributionalStabilityofKernelBasedRegularizationAlgorithms
Here,weshowthatkernelbasedregularizationalgorithmsaredistributionally stable.
Thisfamilyofalgorithmsincludes,amongothers,SupportVectorRegression(SVR)and
kernel ridge regression. Other algorithms such as those based on the relative entropy
regularizationcanbeshowntobedistributionallystableinasimilarwayasforpoint
basedstability.OurresultsalsoapplytoclassificationalgorithmssuchasSupportVector
Machine(SVM)(Cortes&Vapnik,1995)usingamarginbasedlossfunction l asin
(Bousquet&Elisseeff,2002).
Wewillassumethatthecostfunctioncisadmissible,thatisthereexists
(10)
Thisassumptionholdsforthequadraticcostandmostothercostfunctionswhenthehy
pothesissetandthesetofoutputlabelsareboundedbysomeMR+:hH,x
X, |h(x)| M and y Y, |y| M .Wewillalsoassumethat c isdifferentiable.
This assumptionisinfactnotnecessaryandallofourresultsholdwithoutit,butitmakes
thepresentationsimpler.
Let N:HR+beafunctiondefinedoverthehypothesisset.Regularization
basedalgorithmsminimizeanobjectiveoftheform
R
FW (h) =
(h) + N (h),
(11)
where0isatradeoffparameter.WedenotebyBFtheBregmandivergenceasso
ciatedtoaconvexfunctionF,BF(fkg) =F(f)F(g) hfg,F(g)i,anddefine
h as h = h
h.
H .Then,thefollowingboundholds,
BN (h kh) + BN (hkh )
l1(W, W )
sup |h(x)|.
xS
= Bb
Proof.SinceBFW
+ BN andBF
N (h
k h) + B
(h h )
astheminimizersof
+ BN ,andaBregman
RW
divergenceisnonnegative, B
Bythedefinitionof
= Bb
RW
and
FW
FW
h and h
W (h
F
BFW (hkh) + BFW (hkh) = RFW (h) RFW (h) + RFW (h) RFW (h).
Thus,bythe
Wi
admissibilityof
thecostfunction ,usingthenotation
b
(13)
(xi) and
bc
h ).
h) + B FW (h
= W (xi),
(h
BN (h
m
1=
=
h) + B
(h
N
h )
RFW (h) + RF
c(h , zi)W i
(h)
RF
(h )
Wi )
(14)
i=1
m
X
i=1
whichestablishesthelemma.
Given
x1
...,
m
x
byKR
andapositivedefinitesymmetric(PDS)kernel
X
thekernelmatrixdefinedby
,wedenote
K
thelargesteigenvalueofK.
Lemma 2. Let H be a reproducing kernel Hilbert space with kernel K and let the
2
regularizationfunctionNbedefinedbyN() =kk K.Then,thefollowingboundholds,
1
max2
(K) l (
2
BN (h kh) + BN (hkh )
Proof.AsintheproofofLemma1,
BN (h kh) + BN (hkh )
m
i=1
WW
khk2.
(15)
(16)
X
BydefinitionofareproducingkernelHilbertspaceH,foranyhypothesishH,x X,
h(x) = hh, K(x, )i andthusalsoforany h = h h with h, h H , x
nentsarethe
's,andletVdenoteB (h h) + B (h
h ).Using admissibility,
Wi
k
m
N
k
V
i=1
|h(x ) Wi| =
| hh,
W K(x , )i|.Let {1, +1}
denotethesignof
h,
W
P
P
h
N
i=1
iK(xi, )i.Then,
V h,
WiK(xi, ) khkK k
i=1
i WiK(xi, )kK
i=1
X
Wj K(xi, xj )
ij
= khkK
i,j=1
(17)
Wi
X
1
(W )K (W )
= khkK
1/2
Wk2max2 (K).
khkK k
inequalityfollowsfromtheCauchySchwarzinequality
Inthisderivation,thesecond
andthelastinequalityfromthestandardpropertyoftheRayleighquotientforPDS
2 2
distancewith
Proof.ForN( ) =
2
2 max2
2,andforthel2distancewith
,wehaveB
(h
kkK
h) =
(K)
2,thusB
(h
N
kK
h)+B
(h
N
h) =
2khkK andbyLemma1,
2k
kK
l1(W, W ) h
xS
Thus khkK
l1 (W,W)
.By admissibilityof c,
2 2 l1(W, W)
(18)
(19)
(20)
which shows thewhich shows the
distributional stability of a
kernelbased regularizationdistributional stability
algorithmforthe l1 distance.of a kernelbased
Using Lemma 2, a similarregularization
derivationleadsto
algorithm for the l2
distance.
z
X
Note that the standard
corresponding distribution,
then
|| ||K
1
l1(WU , WU ) =
m1
i=1
X
m 1 1
2
=m .
(22)
Thus,inthecaseofkernelbasedregularizedalgorithmsandforthel1distance,standard
uniformstabilityisaspecialcaseofdistributionalstability.Itcanbeshown
(
similarlythatl2
U ) =
m(m1)
EffectofEstimationErrorforKernelBasedRegularization
Algorithms
Thissectionanalyzestheeffectofanerrorintheestimationoftheweightofatrain
ingexampleonthegeneralizationerrorofthehypothesis h returnedbyaweight
sensitivelearningalgorithm.Wewillexaminetwoestimationtechniques:astraight
forwardhistogrambasedorclusterbasedmethod,andkernelmeanmatching(KMM)
(Huangetal.,2006b).
4.1
ClusterBasedEstimation
andletn=|U|andnx=|Ux|.Furthermore,letn denotethenumberofuniquepointsinthesample
mx
Pr[s = 1|x]
Proof.Forafixed x
Pr Pr[s = 1|x] mx
i=1
h
nx
p0 n
(23)
inequality,
U ,byHoeffding's
nx
log 2m + log
mx | | n = i
x
Pr | Pr[s = 1|x]
x
Pr[nx = i]
X
n
2i2
X
2
Pr[nx = i].
i=1
SincenxisabinomialrandomvariablewithparametersPrU[x] =pxandn,thislasttermcanbe
expressedmoreexplicitlyandboundedasfollows:
n
i=1
e2i
Pr[nx
U
= ]2
i=0
e2i
n !p
i
(1
x
px
ni
pxe2
= 2(
+ (1
X
= 2(1 px (1 e22 ))n 2 exp(pxn(1 e22 )).
px
))
Sinceforx[0,1],1e x/2,thisshowsthatfor[0,1],
Pr Pr[s = 1|x]
U
mx
nx
thedefinitionof
px n
e
(24)
p0
Bytheunionboundand
mx
Pr
U
x S : Pr[s = 1|x] nx
2mep0 n
Settingtomatchtheupperboundyieldsthestatementofthelemma.
ThefollowingpropositionboundsthedistancebetweenthedistributionWcorresponding
toaperfectlyreweightedsample(SW)andtheonecorrespondingtoasamplethat
c
1
1 ,
andW(xi) = m p(x )
c
i
where,foradistinctpointxequaltothesampledpointx
i,wedefinep(xi) = Pr[s=
mx
i=1,...,m
ofthedistributionsWandWcanbeboundedasfollows,
c
s log 2
l1(W, W) B2
log 2m + log
andl2(W,W)B2s
m + log
p0n
p0nm
(26)
Proof.Bydefinitionofthe l2 distance,
)=1
l ( ,
i=1
1
p(x i )
1 m i max(p(xi)
= 1
2
p(x i )
B4
p(xi) p(xi )
p(x i )p(x i)
i=1
X
2
p(xi)) .
2
Itcanbeshownsimilarlythatl1(W,W)B maxi|p(xi)p(xi)|.Theapplication
oftheuniformconvergenceboundofLemma3directlyyieldsthestatementofthe
proposition.
Thefollowingtheoremprovidesaboundonthedifferencebetweenthegeneralization
error of the hypothesis returned by a kernelbased regularization algorithm when
trainedontheperfectlyunbiaseddistribution,andtheonetrainedonthesamplebias
correctedusingfrequencyestimates.
Theorem2. Let K beaPDSkernelsuchthat K(x, x) < forall x X .Let hW bethe
2
W
c
hypothesisreturnedbytheregularizationalgorithmbasedon N () = kk K
usingS ,andh
theonereturnedaftertrainingthesamealgorithmonSW .Then,
(25)
forany >0,withprobabilityatleast1,thedifferenceingeneralizationerrorof
thesehypothesesisboundedasfollows
log 2m
2 2 2
|R(hW) R(hW)|
+ log
(27)
p0n
log 2m + log
2 max2 (K)B2
|R(hW) R(hW)|
p0nm
Proof. The result follows from Proposition 2, the distributional stability and the bounds on the
stabilitycoefficientforkernelbasedregularizationalgorithms(Theorem1),andtheboundsonthe
l1andl2distancesbetweenthecorrectdistributionWandthe
c
estimateW.
onfrequencyestimatesgoestozeroas
n0
.Inpractice,m m,thenumberof
distinctpointsinSissmall,afortiori,logm isverysmall,thus,theconvergenceratedepends
essentiallyontherateatwhich n0 increases.Additionally,if max(K) m (suchaswith
Gaussiankernels),thel2basedboundwillprovideconvergencethatisatleastasfast.
4.2
KernelMeanMatching
ThefollowingdefinitionsintroducedbySteinwart(2002)willbeneededforthepre
sentation and discussion of the kernel mean matching (KMM) technique. Let X be a
compactmetricspaceandletC(X ) denotethespaceofallcontinuousfunctionsover X
equippedwiththestandardinfinitenorm k k.Let K : X X R beaPDS kernel.
ThereexistsaHilbertspaceFandamap:XFsuchthatforallx, yX,K(x, y) =
h(x), (y)i.Notethatforagivenkernel K, F and arenotuniqueand that,forthese
definitions,FdoesnotneedtobeareproducingkernelHilbertspace(RKHS).
LetPdenotethesetofallprobabilitydistributionsoverXandlet:P Fbe
thefunctiondefinedby
p P, (p) = E [(x)].
xp
Theorem3(Huangetal.(2006a)).LetFbeaseparableHilbertspaceandletKbea
universal kernel withfeaturespace F andfeaturemap : X F .Then, is
injective.
Proof.Wegiveafullproofofthemaintheoremsupportingthistechniqueintheap
pendix.TheproofgivenbyHuangetal.(2006a)doesnotseemtobecomplete.
(28)
TheKMMtechniqueisapplicablewhenthelearningalgorithmisbasedonauniversal
kernel.Thetheoremshowsthatforauniversalkernel,theexpectedvalueofthefea
turevectorsinduceduniquelydeterminestheprobabilitydistribution.KMMusesthis
propertytoreweighttrainingpointssothattheaveragevalueofthefeaturevectorsfor
thetrainingdatamatchesthat ofthefeaturevectorsforasetofunlabeledpoints
drawnfromtheunbiaseddistribution.
Let i denotetheperfectreweightingofthesample
point xi and bi theestimate
derivedbyKMM.LetB denotethelargestpossiblereweightingcoefficientandlet be
a positive realnumber. We will assume that is chosen so that 1/2. Then, the
followingistheKMMconstraintoptimization
min G() =
(x )
k
m i=1
(x )
i
i 1
X
i=1
(29)
subjecttoi[0, B]
k
n i=1
Letbethesolutionofthisoptimizationproblem,then
mi=1
= i/(1 + ).The
normalizedweightsusedin
P
with
KMM'sreweightingofthesamplearethusdefinedby
i/m
initeuniversalkernelK,wedenotebyK
Rm
i=1 i
=1
X andastrictlypositivedef
P
Asintheprevioussection,givenx1, . . . , xm
= 1 + with
thekernelmatrixdefinedby
Proposition4.LetKbeastrictlypositivedefiniteuniversalkernel.Then,forany >
1
m
2B
2 2
min
B2
(K)
mn
1+
.(30)
2 log
r
b
Proof.Sincetheoptimalreweightingverifiestheconstraintsoftheoptimization,by
definitionofbasaminimizer,G(b)G().Thus,bythetriangleinequality,
1
km
i=1
(31)
X b
X
b
LetLdenotethelefthandsideofthisinequality:L=
definitionofthenormintheHilbertspace,
L=
1
m
i=1(i
)K(
P
i)(xi )k.By
) .Then,bythe
matrices,
standardboundsfortheRayleighquotientofPDS
min( )
2G() .
1
2
K
b
ThiscombinedwithInequality31yields
1
( )
mb
kb
) 2.
k
(32)
mkb
min2 (K)
Thus,bythetriangleinequality,
1
1
+ m k(b
m k(b
2G()
)k2
1
1
min2 (K)
m k(b
)
k2
(
3
2G()
3
1
|
min2 (K) 2 )
kk
+
b)k2
2| |B
+
S/m and hb
2G()
1
2 (K)
min
Let h be the
the r ith el.
hypothesis returned
It is
a b m bytheregularization
algorithm 2based on
follo
not
l y wh N() =kk Kusing t
h
wing i en
e
diffi
o
n
z
a
trai
hold
e
cult
r
a ned
e
s
t
to
t k on
u
r
(Le i e the
n
sho
e
d
o
r
tru
mma
a
w
f
nne
t
4,
e
e dist
usin
r
t
r
(Hua e l rib
a
g
i
r uti
n
nget
i
McD
r b on,
n
g
al., o a and
t
iarm
h
e
r
s
the
2006
s
id's
a
o e one
m
a)):
e
ineq
f d trai
a
l
t ned
g
o
r
h r on
i
y
t
e e the
h
m
h g sa
that
o
ThiscombinedwithInequality33yieldsthestatementoftheproposition.
y u mp
n
for
S
The p l le
b
ualit
=
1
0,
|
R(h
the
)
R(
hb bo
)|
2 un
m
d
ax2
(K) be
co
me
Fo s
r
B1
m
i
mn2
+
(
K
)
ersal follows
kern
m +n1 +
co
ns
ta
nt
Pr
[s
=
Proof.
1]
C w
o hi
m ch
pa is
ri no
ng t
thi in
s cl
bo ud
un ed
d in
fo th
r e
cl
= us
0 te
wi r
th ba
th se
e d
l2 re
bo w
un ei
d gh
of ti
T ng
he .
or T
e hu
m s,
4, th
we
e cl
fir us
st te
no r
te ba
th se
at d
B co
an nv
d er
B ge
ar nc
e e
es is
se of
nti th
all e
y or
rel de
at r
ed
m
od
ul
o
th
e
r
r 2 log .
2
O(max (K)B
of
th
e
cl
u
st
er
b
as
e
d
b
o
u
n
d
is
m
or
e
fa
v
or
a
bl
e,
w
hi
le
fo
r
ot
h
er
v
al
u
es
th
e
K
M
M
b
o
u
n
d
c
o
n
v
er
g
es
fa
st
er
.
1
andtheKMMconvergenceoftheorderO(cond2(K)
Takingtheratio
expressionO
log m
p0nm
B
m
).
1
oftheformeroverthelatterandnoticingp0 O(B),weobtainthe
min(
K)B log m
ExperimentalResults
Inthissection,wewillcomparetheperformanceoftheclusterbasedreweighting
techniqueandtheKMMtechniqueempirically.Wewillfirstdiscussandanalyzethe
propertiesoftheclusteringmethodandourparticularimplementation.
TheanalysisofSection4.1dealswithdiscretepointspossiblyresultingfromtheuse
ofaquantizationorclusteringtechnique.However,duetotherelativelysmallsizeofthe
publictrainingsetsavailable,clusteringcouldleaveuswithfewclusterrepresentativesto
trainwith.Instead,inourexperiments,weonlyusedtheclusterstoestimatesampling
probabilitiesandappliedtheseweightstothefullsetoftrainingpoints.Asthefollowing
proposition shows, the l1 and l2 distance bounds of Proposition 5 do not change
significantlysolongastheclustersizeisroughlyuniformandthesamplingprobabilityis
thesameforallpointswithinacluster.Wewillrefertothisastheclusteringassumption.
Inwhat follows, let Pr[s = 1|Ci] designate the sampling probability for all x Ci.
Finally,defineq(Ci) = Pr[s= 1|Ci]andq(Ci) =|CiS|/|CiU|.
Proposition5.LetB=
i=1,...,m
ofthedistributionsWandWcanbeboundedasfollows,
CM k(log 2k + log 1 )
l1 (W, W) B
| |
q 0 nm
(W, W) =
(W, W) B
l2
XX
i=1 xC i
1
p(x)
B4 |CM | k
p(x)
m2
2
max(q(Ci ) q(Ci ))
m2
i=1
q0 nm2
i=1 xCi
XX
1
q(C i )
q(C i )
the
whichisoptimalbut
nowtheunionboundistaken
Pr[s=1|x]
MSE for
over the clusters rather thanclusters. A d
Table1. Normalizedmeansquarederror(NMSE)forvariousregressiondatasetsusingun
weighted,ideal,clusteredandkernelmeanmatchedtrainingsamplereweightings.
DATASET
ABALONE
BANK32NH
BANK8FM
CALHOUSING
CPUACT
CPUSMALL
HOUSING
KIN8NM
PUMA8NH
|U |
2000
4500
4499
16512
4000
4000
300
5000
4499
IDEAL
CLUSTERED
KMM
|S| ntest UNWEIGHTED
7242177 .654.019
.551.032.623.034.709.122
.610.044.635.046.691.055
23843693 .903.022
.058.001.068.002.079.013
19983693 .085.003
.360.009.375.010.595.054
95114128 .395.010
.523.080.568.018.518.237
24004192 .673.014
.477.097.408.071.531.280
23684192 .682.053
.390.053 .482.042
.469.148
116206 .509.049
.523.045 .574.018
.704.068
25103192 .594.008
.674.019 .641.012
.903.059
22463693 .685.013
KMMmethodusestheapproachofHuangetal.(2006b)withaGaussiankernelandp
d
heldoutset(Sugiyamaetal.,2008).
1+e
wherev=4w(xx),xRdandwRdchosenatrandomfrom[1,1]d.Inour
w(xx)
experiments,wechosetenrandomprojectionswandreportedresultswiththew,for
each data set, that maximizes the difference between the unweighted and ideal
methodsoverrepeatedsamplingtrials.Inthisway,weselectedbiassamplingsthat
aregoodcandidatesforbiascorrectionestimation.
5
Forourexperiments,weusedaversionofSVRavailablefromLibSVM thatcan
takeasinputweightedsamples,withparametervaluesC= 1,and= 0.1combined
withaGaussiankernelwithparameter= d/2.Wereportresultsusingnormalized
meansquarederror(NMSE):1
ntest (y
ntest
y )
y2
deviationsfortenfoldcross
validation.
Ourresultsshowthatreweightingwithmorereliablecounts,duetoclustering,canbe
effective in the problem of sample bias correction. These results also confirm the
dependencethatourtheoreticalboundsexhibitonthequantity n0.Theresultsobtained
6
usingKMMseemtobeconsistentwiththosereportedbytheauthorsofthistechnique.
Conclusion
Wepresentedageneralanalysisofsampleselectionbiascorrectionandgavebounds
analyzingtheeffectofanestimationerrorontheaccuracyofthehypothesesreturned.
Thenotionofdistributionalstabilityandthetechniquespresentedaregeneralandcan
4
www.liaad.up.pt/ltorgo/Regression/DataSets.html.
www.csie.ntu.edu.tw/cjlin/libsvmtools.
6WethankArthurGrettonfordiscussionandhelpinclarifyingthechoiceoftheparameters
anddesignoftheKMMexperimentsreportedin(Huangetal.,2006b),andforprovidingthe
codeusedbytheauthorsforcomparisonstudies.
beofindependentinterestfortheanalysisoflearningalgorithmsinothersettings.In
particular,thesetechniquesapplysimilarlytootherimportanceweightingalgorithms
andcanbeusedinothercontextssuchthatoflearninginthepresenceofuncertain
labels. The analysis of the discriminative method of (Bickel et al., 2007) for the
problemofcovariateshiftcouldperhapsalsobenefitfromthisstudy.
References
Bickel,S.,Bruckner,M.,&Scheffer,T.(2007).Discriminativelearningfordiffering
trainingandtestdistributions.ICML2007(pp.8188).
Bousquet,O.,&Elisseeff,A.(2002).Stabilityandgeneralization.JMLR,2,499526.
Breiman,L.,Friedman,J.,Olshen,R.,&Stone,C.(1984).Classificationandregression
trees.BocaRaton,FL:CRCPress.
Cortes,C.,&Vapnik,V.N.(1995).SupportVectorNetworks. MachineLearning,
20,273297.
Devroye, L., & Wagner, T. (1979). Distributionfree performance bounds for
potentialfunctionrules.IEEETrans.onInformationTheory(pp.601604).
Dudk,M.,Schapire,R.E.,&Phillips,S.J.(2006).Correctingsampleselectionbias
inmaximumentropydensityestimation.NIPS2005.
Elkan,C.(2001).Thefoundationsofcostsensitivelearning. IJCAI (pp.973978).
Fan,W.,Davidson,I.,Zadrozny,B.,&Yu,P.S.(2005).Animprovedcategorization
ofclassifier'ssensitivityonsampleselectionbias.ICDM2005(pp.605608).
IEEEComputerSociety.
Heckman,J.J.(1979).SampleSelectionBiasasaSpecificationError.Econometrica,
47,151161.
Huang, J., Smola, A., Gretton, A., Borgwardt, K., & Scholkopf, B. (2006a).
CorrectingSampleSelectionBiasbyUnlabeledData.TechnicalReportCS2006
44).UniversityofWaterloo.
Huang,J.,Smola,A.J.,Gretton,A.,Borgwardt,K.M.,&Scholkopf,B.(2006b).
Correctingsampleselectionbiasbyunlabeleddata.NIPS2006(pp.601608).
Kearns,M.,&Ron,D.(1997).Algorithmicstabilityandsanitycheckboundsfor
leaveoneoutcrossvalidation.COLT1997(pp.152162).
Little,R.J.A.,&Rubin,D.B.(1986). Statisticalanalysiswithmissingdata.New
York,NY,USA:JohnWiley&Sons,Inc.
Saunders,C.,Gammerman,A.,&Vovk,V.(1998).RidgeRegressionLearningAlgo
rithminDualVariables.ICML1998(pp.515521).
Steinwart,I.(2002).Ontheinfluenceofthekernelontheconsistencyofsupport
vectormachines.JMLR,2,6793.
Sugiyama,M.,Nakajima,S.,Kashima,H.,vonBunau,P.,&Kawanabe,M.(2008).
Directimportanceestimationwithmodelselectionanditsapplicationtocovariate
shiftadaptation.NIPS2008.
Vapnik, V. N. (1998). Statistical learning theory. New York: WileyInterscience.
Zadrozny,B.(2004).Learningandevaluatingclassifiersundersampleselectionbias.
ICML2004.
Zadrozny, B., Langford, J., & Abe, N. (2003). Costsensitive learning by cost
proportionateexampleweighting.ICDM2003.
ProofofTheorem3
Proof.Assumethat(p) =(q)fortwoprobabilitydistributionspandqinP.Itis
knownthatifExp[f(x)] = Exq[f(x)]foranyfC(X),thenp=q.LetfC(X)
andfix
> 0.Since K isuniversal,thereexistsafunction g inducedby K suchthat
kf gk . Exp[f (x)] Exq [f (x)] canberewrittenas
xEp[f
E [f (x)]
xp
Since
(36)
isinduced
by
E [f (x)]
xq
E [g(x)]
xp
,thereexistsw
suchthatforallx
xq
(37)
+ 2.
E [g(x)]
,g(x) =
w, (x) .
N,considerthepartialsum
Then,
N
gN (x) =
inequality,
n=0
n=0
wn n(x)
).BytheCauchySchwarz
wnn (x
n=0
.Foreach
g(x) =
SinceFisseparable,itadmitsacountableorthonormalbasis(en)nN.FornN,
1/2
k n (x)en k2
1/2
kwk2
1/2
1/2
k(x)k2
(38)
n=0
X
X
SinceKisuniversal,itiscontinuousandthusisalsocontinuous(Steinwart,2002).
Thusx7k(x)k2isacontinuousfunctionoverthecompactXandadmitsanupper
boundB0.Thus,|gN(x)|
kwk2B.Theintegral
kwk2B dp isclearlywell
definedandequals
w
p
R p
2B.
k k
holds:
Thus,bytheLebesguedominatedconvergencetheorem,
thefollowing
E [g(x)] =
n=0
w (x)dp(x) =
xp
n=0
n(x)dp(x).
(39)
n n
BydefinitionofEx p[(x)],thelasttermistheinnerproductofwandthatterm.Thus,
E [g(x)] =
xp
w, xp (x)
(40)
= h w, (p)
i.
Asimilarequalityholdswiththedistributionq,thus,
xEp[g(x)]
Thus,Inequality37canberewrittenas
xp
forall >0.Thisimplies
injectivityof.
E [f (x)]
xq
(41)
E [f (x)] 2,
[f (x)] forall
C(X ) andthe