Sample Selection Bias Correction Theory Analysis

SampleSelectionBiasCorrectionTheory
CorinnaCortes ,MehryarMohri
2,1
,MichaelRiley ,andAfshinRostamizadeh
GoogleResearch,
76NinthAvenue,NewYork,NY10011.
12Sciences,251MercerStreet,NewYork,NY
Courant Institute of Mathematical
10012.
Abstract. Thispaperpresentsatheoreticalanalysisofsampleselectionbiascor
rection.Thesamplebiascorrectiontechniquecommonlyusedinmachinelearning
consistsofreweightingthecostofanerroroneachtrainingpointofabiasedsample
tomorecloselyreflecttheunbiaseddistribution.Thisreliesonweightsderivedby
variousestimationtechniquesbasedonfinitesamples.Weanalyzetheeffectofan
errorinthatestimationontheaccuracyofthehypothesisreturnedbythelearning
algorithmfortwoestimationtechniques:aclusterbasedestimationtechniqueand
kernel mean matching. We also report the results of sample bias correction
experimentswithseveraldatasetsusingthesetechniques.Ouranalysisisbasedon
thenovelconceptofdistributionalstability whichgeneralizestheexistingconcept
ofpointbasedstability. Much ofourwork andprooftechniques canbeusedto
analyzeotherimportanceweightingtechniquesandtheireffectonaccuracywhen
usingadistributionallystablealgorithm.
Introduction
Inthestandardformulationofmachinelearningproblems,thelearningalgorithmre
ceivestrainingandtestsamplesdrawnaccordingtothesamedistribution.However,
thisassumptionoftendoesnotholdinpractice.Thetrainingsampleavailableisbi
asedinsomeway,whichmaybeduetoavarietyofpracticalreasonssuchasthecost
ofdatalabelingoracquisition.Theproblemoccursinmanyareassuchasastronomy,
econometrics,andspecieshabitatmodeling.
Inacommoninstanceofthisproblem,pointsaredrawnaccordingtothetestdis
tributionbut notall ofthem aremadeavailabletothelearner.Thisiscalledthe
sampleselectionbiasproblem.Remarkably,itisoftenpossibletocorrectthisbiasby
usinglargeamountsofunlabeleddata.
Theproblemofsampleselectionbiascorrectionforlinearregressionhasbeenex
tensivelystudiedineconometricsandstatistics(Heckman,1979;Little&Rubin,1986)
with the pioneering work of Heckman (1979). Several recentmachine learning publi
cations(Elkan,2001;Zadrozny,2004;Zadroznyetal.,2003;Fanetal.,2005;Dudket
al.,2006)havealsodealtwiththisproblem.Themaincorrectiontechniqueusedinallof
thesepublicationsconsistsofreweightingthecostoftrainingpointerrorstomoreclosely
reflectthatofthetestdistribution.Thisisinfactatechniquecommonlyusedinstatistics
andmachinelearningforavarietyofproblemsofthistype(Little&Rubin,1986).With
theexactweights,thisreweightingcouldoptimallycorrectthebias,but,inpractice,the
weightsarebasedonanestimateofthesamplingprobabilityfromfinitedatasets.Thus,it
isimportanttodeterminetowhatextenttheerrorinthisestimation
canaffecttheaccuracyofthehypothesisreturnedbythelearningalgorithm.Toour
knowledge,thisproblemhasnotbeenanalyzedinageneralmanner.
Thispapergivesatheoreticalanalysisofsampleselectionbiascorrection.Our
analysisisbasedonthenovelconceptof distributionalstability whichgeneralizes
thepointbasedstabilityintroducedandanalyzedbypreviousauthors(Devroye&
Wagner,1979;Kearns&Ron,1997;Bousquet&Elisseeff,2002).Weshowthat
large families of learning algorithms, including all kernelbased regularization
algorithmssuchasSupportVectorRegression(SVR)(Vapnik,1998)orkernelridge
regression (Saunders et al., 1998) are distributionally stable and we give the
expressionoftheirstabilitycoefficientforboththel1andl2distance.
Wethenanalyzetwocommonlyusedsamplebiascorrectiontechniques:acluster
basedestimationtechniqueandkernelmeanmatching(KMM)(Huangetal.,2006b).For
each ofthese techniques, we derive boundsonthe difference of the error rate ofthe
hypothesis returned by a distributionally stable algorithm when using that estimation
techniqueversususingperfectreweighting.Webrieflydiscussandcomparethesebounds
and also report theresults ofexperiments with both estimation techniques for several
publiclyavailablemachinelearningdatasets.Muchofourworkandprooftechniquescan
beusedtoanalyzeotherimportanceweightingtechniquesandtheireffectonaccuracy
whenusedincombinationwithadistributionallystablealgorithm.
Theremainingsectionsofthispaperareorganizedasfollows.Section2describes
in detail the sample selection bias correction technique. Section 3 introduces the
concept of distributional stabilityandprovesthedistributional stabilityof kernel
basedregularizationalgorithms.Section4analyzes the effect ofestimationerror
using distributionally stable algorithms for both the clusterbased and the KMM
estimationtechniques.Section5reportstheresultsofexperimentswithseveraldata
setscomparingtheseestimationtechniques.
SampleSelectionBiasCorrection
2.1
Problem
Let X denote the input space and Y the label set, which may be {0, 1} in
classificationoranymeasurablesubsetof R inregressionestimationproblems,and
letDdenotethetruedistributionoverXYaccordingtowhichtestpointsaredrawn.
Inthesampleselectionbiasproblem,somepairsz= (x, y)drawnaccordingtoDare
not made available to the learning algorithm. The learning algorithm receives a
trainingsample S of m labeledpoints z1, . . . , zm drawnaccordingtoa biased
distribution D over X Y . Thissamplebiascanberepresentedbyarandomvariable

stakingvaluesin{0,1}:whens= 1thepointissampled,otherwiseitisnot.Thus,
bydefinitionofthesampleselectionbias,thesupportofthebiaseddistributionD is
includedinthatofthetruedistributionD.
Asinstandardlearningscenarios,theobjectiveofthelearningalgorithmistoselecta
hypothesishoutofahypothesissetHwithasmallgeneralizationerrorR(h)with
respecttothetruedistributionD,R(h) = E(x,y)D[c(h, z)],wherec(h, z)isthecostof

theerrorofhonpointzXY.
WhilethesampleSiscollectedinsomebiasedmanner,itisoftenpossibletoderive
someinformationaboutthenatureofthebias.Thiscanbedonebyexploitinglarge
amountsofunlabeleddatadrawnaccordingtothetruedistributionD,whichisoften
availableinpractice.Thus,inthefollowingletUbeasampledrawnaccordingtoD
andSUalabeledbutbiasedsubsample.
2.2
WeightedSamples
AweightedsampleSwisatrainingsampleSofmlabeledpoints,z1, . . . , zmdrawni.i.d.
fromXY,thatisaugmentedwithanonnegativeweightwi0foreachpointzi.This
weightisusedtoemphasizeordeemphasizethecostofanerroron zi asin thesocalled
importanceweightingorcostsensitivelearning(Elkan,2001;Zadroznyetal.,2003).One
couldusetheweightswitoderiveanequivalentbutlargerunweightedsampleS where
the multiplicity of zi would reflect its weight wi, but most learning algorithms, e.g.,
decisiontrees,logisticregression,AdaBoost,SupportVectorMachines(SVMs),kernel
ridgeregression,candirectlyacceptaweightedsampleSw.Wewillrefertoalgorithms
thatcandirectlytakeSwasinputasweightsensitivealgorithms.
TheempiricalerrorofahypothesishonaweightedsampleSwisdefinedas
m
Rw (h) =
b
coincideswiththatofDand
Proposition1.LetD beadistribution
let)
forallpoints zi in S.
whosesupportSw beaweightedsample
Then,
with wi = PrD(zi)/ PrD (zi
i=1
wi c(h, zi).
(1)
E [Rw (h)] = R(h) = E [c(h, z)].

SD
pointsaredrawni.i.d.,
Proof.Sincethesample
b
E [Rw (h)] = 1
SD
E [wic(h, zi)] =
2.3
Bias
Correction
Theprobabilityof
drawing z = (x,
y) according to
the true but
unobserved
distribution D can
be
straightforwardly
related to the
observed
distributionD .By
definition of the
random variable
s, the observed
biased
distributionD can
be expressed by
PrD [z] = PrD[z|
s = 1]. We will
assume that all
points z in the
supportof D can
besampledwith a
nonzero
probabilitysothe
supportofDandD
coincide. Thus
for all z X
Y , Pr[s = 1|z]
=6 0. Then, by
the
Bayes
z 1D
z SD
E [w1 c(h, z1)].
(3)
Bydefinitionofw berewritten
andthefactthatthe asfollows
supportofDsidecan
Thislasttermis formula,
thedefinitionof
forall z in
the
generalization thesupport
errorR(h).
of D,
(2)
zD
andD coincide,the ) c(h, z1) = (4)

(z righthand
E [c(h, z1)].
z1D
Pr[z] =
D
Pr[z|s = 1] Pr[s = 1] = Pr[s = 1] Pr[z].

Pr[s = 1 z]
Pr[s = 1 z] D
(5)
Thus,ifweweregiventheprobabilitiesPr[s= 1]andPr[s= 1|z],wecouldderive

the trueprobability PrD from the biasedone PrD exactlyand correct the sample
selectionbias.
Itisimportanttonotethatthiscorrectionisonlyneededforthetrainingsample S,
sinceitistheonlysourceofselectionbias.Withaweightsensitivealgorithm,itsuffices
i
toreweighteachsamplez withtheweightw =
Pr[s=1]
.Thus,Pr[s= 1|z]need
Pr[s=1|zi ]
notbeestimatedforallpoints z butonlyforthosefallinginthetrainingsample S.By

Proposition1,theexpectedvalueoftheempiricalerrorafterreweightingisthesameasif
weweregivensamplesfromthetruedistributionandtheusualgeneralizationbounds
holdfor
and
R(h)
.
R(h)
Whenthesamplingprobabilityisindependentofthelabels,asitiscommonlyas
sumed in many settings (Zadrozny 2004; 2003), Pr[s = 1|z] = Pr[s = 1|x], and
Equation5canberewrittenas
Pr[z] =
D
Pr[s = 1] Pr[z].
Pr[s = 1 x] D
|
Inthatcase,theprobabilities Pr[s = 1] and Pr[s = 1|x] neededtoreconstitute PrD
fromPrDdonotdependonthelabelsandthuscanbeestimatedusingtheunlabeled
points in U . Moreover, as already mentioned, for weightsensitive algorithms, it
sufficestoestimatePr[s= 1|xi]forthepointsxiofthetrainingdata;nogeneralization
isneeded.
3
Asimplecaseiswhenthepointsaredefinedoveradiscreteset. Pr[s= 1|x]can
thenbeestimatedfromthefrequencymx/nx,wheremxdenotesthenumberoftimesx
appearedin S U and nx thenumberoftimes x appearedinthefulldataset U .
Pr[s = 1] canbeestimatedbythequantity |S|/|U |.However,since Pr[s = 1] isa
constantindependentofx,itsestimationisnotevennecessary.
IftheestimationofthesamplingprobabilityPr[s= 1|x]fromtheunlabeleddata
set U wereexact,thenthereweightingjustdiscussedcouldcorrectthesamplebias
optimally.Severaltechniqueshavebeencommonlyusedtoestimatethereweighting
quantities.But,theseestimateweightsarenotguaranteedtobeexact.Thenextsec
tionaddresseshowtheerrorinthatestimationaffectstheerrorrateofthehypothesis
returnedbythelearningalgorithm.
DistributionalStability
Here,wewillexaminetheeffectontheerrorofthehypothesisreturnedbythelearning
algorithminresponsetoachangeinthewaythetrainingpointsareweighted.Sincethe
weightsarenonnegative,wecanassumethattheyarenormalizedanddefineadistribu
tionoverthetrainingsample.Thisstudycanbeviewedasageneralizationofstability
analysiswhereasinglesamplepointischanged(Devroye&Wagner,1979;Kearns&
Ron,1997;Bousquet&Elisseeff,2002)tothemoregeneralcaseofdistributionalstability
wherethesample'sweightdistributionischanged.
Thus,inthissectionthesampleweightWofSWdefinesadistributionoverS.Fora
fixedlearningalgorithmLandafixedsampleS,wewilldenotebyhWthehypothesis
3
Thiscanbeasaresultofaquantizationorclustering
techniqueasdiscussedlater.
(6)
returnedbyLfortheweightedsampleSW.Wewilldenotebyd(W,W )adivergencemeasurefor
twodistributionsWandW .Therearemanystandardmeasuresforthedivergencesordistances
betweentwodistributions,includingtherelativeentropy,theHellingerdistance,andthe lp
distance.
Definition1(DistributionalStability).AlearningalgorithmLissaidtobedistributionally
stableforthedivergencemeasuredifforanytwoweightedsamplesSWandSW,
z X Y, |c(hW, z) c(hW , z)| d(W, W ).

Thus, an algorithm is distributionally stable when small changes to a weighted
sample'sdistribution,asmeasuredbyadivergenced,resultinasmallchangeinthe
costofanerroratanypoint.Thefollowingpropositionfollowsdirectlyfromthe
definitionofdistributionalstability.
(7)
Proposition2.LetLbeadistributionallystablealgorithmandlethW(hW)denote
).
thehypothesisreturnedbyLwhentrainedontheweightedsampleSW
Let WT denotethedistributionaccordingtowhichtestpointsaredrawn.Then,the
(resp.SW
followingholds
|R(hW) R(hW )| d(W, W ).

Proof.Bythedistributionalstabilityofthealgorithm,
z
(8)
E T [|c(z, hW) c(z, hW )|] d(W, W ),
(9)
whichimpliesthestatementoftheproposition.
3.1
DistributionalStabilityofKernelBasedRegularizationAlgorithms
Here,weshowthatkernelbasedregularizationalgorithmsaredistributionally stable.
Thisfamilyofalgorithmsincludes,amongothers,SupportVectorRegression(SVR)and
kernel ridge regression. Other algorithms such as those based on the relative entropy
regularizationcanbeshowntobedistributionallystableinasimilarwayasforpoint
basedstability.OurresultsalsoapplytoclassificationalgorithmssuchasSupportVector
Machine(SVM)(Cortes&Vapnik,1995)usingamarginbasedlossfunction l asin
(Bousquet&Elisseeff,2002).
Wewillassumethatthecostfunctioncisadmissible,thatisthereexists
R+suchthatforanytwohypothesesh, h Handforallz= (x, y)XY,
|c(h, z) c(h , z)| |h(x) h (x)|.
(10)
Thisassumptionholdsforthequadraticcostandmostothercostfunctionswhenthehy
pothesissetandthesetofoutputlabelsareboundedbysomeMR+:hH,x
X, |h(x)| M and y Y, |y| M .Wewillalsoassumethat c isdifferentiable.
This assumptionisinfactnotnecessaryandallofourresultsholdwithoutit,butitmakes
thepresentationsimpler.
Let N:HR+beafunctiondefinedoverthehypothesisset.Regularization
basedalgorithmsminimizeanobjectiveoftheform
R
FW (h) =
(h) + N (h),
(11)
where0isatradeoffparameter.WedenotebyBFtheBregmandivergenceasso
ciatedtoaconvexfunctionF,BF(fkg) =F(f)F(g) hfg,F(g)i,anddefine
h as h = h
h.
Lemma1. Letthehypothesisset H beavectorspace.Assumethat N isaproper closedconvex

functionandthatNisdifferentiable.AssumethatFWadmitsaminimizerh H and FW aminimizer
h
H .Then,thefollowingboundholds,
BN (h kh) + BN (hkh )
l1(W, W )
sup |h(x)|.
xS
= Bb
Proof.SinceBFW
+ BN andBF
N (h
k h) + B
(h h )
astheminimizersof
+ BN ,andaBregman
RW
divergenceisnonnegative, B
Bythedefinitionof
= Bb
RW
and
FW
FW
h and h
W (h
F
BFW (hkh) + BFW (hkh) = RFW (h) RFW (h) + RFW (h) RFW (h).
Thus,bythe
Wi
admissibilityof
thecostfunction ,usingthenotation
b
(13)
(xi) and
bc
h ).
h) + B FW (h
= W (xi),
(h
BN (h
m
1=
=
h) + B
(h
N
h )
RFW (h) + RF
c(h , zi)W i c(h, zi)W i + c(h, zi)W i

i=1
m
(c(h , zi) c(h, zi))(Wi
c(h , zi)W i
(h)
RF
(h )
Wi )
(14)
i=1
m
X
|h(xi)||Wi Wi| l1(W, W) sup |h(x)|,

xS
i=1
whichestablishesthelemma.
Given
x1
...,
m
x
byKR
andapositivedefinitesymmetric(PDS)kernel
X
thekernelmatrixdefinedby
,wedenote
K
Kij = K(xi, xj ) andby max(K) R+
thelargesteigenvalueofK.
Lemma 2. Let H be a reproducing kernel Hilbert space with kernel K and let the
2
regularizationfunctionNbedefinedbyN() =kk K.Then,thefollowingboundholds,
1
max2
(K) l (
2
Proof.AsintheproofofLemma1,
m
i=1
WW
khk2.
(15)
(c(h , zi) c(h, zi))(Wi Wi ) .
(16)
X
BydefinitionofareproducingkernelHilbertspaceH,foranyhypothesishH,x X,
h(x) = hh, K(x, )i andthusalsoforany h = h h with h, h H , x
Wi denote Wi Wi, W thevectorwhosecompo

X, h(x) = hh, K(x, )i.Let
nentsarethe
's,andletVdenoteB (h h) + B (h
h ).Using admissibility,
Wi
k
m
N
k
V
i=1
|h(x ) Wi| =
| hh,
W K(x , )i|.Let {1, +1}
denotethesignof
h,
W
P
P
h
N
i=1
iK(xi, )i.Then,
V h,
WiK(xi, ) khkK k
i=1
i WiK(xi, )kK
i=1
X
Wj K(xi, xj )
ij
= khkK
i,j=1
(17)
Wi
X
1
(W )K (W )
= khkK
1/2
Wk2max2 (K).
khkK k
inequalityfollowsfromtheCauchySchwarzinequality
Inthisderivation,thesecond
andthelastinequalityfromthestandardpropertyoftheRayleighquotientforPDS
matrices.Sincek Wk2=l2(W,W ),thisprovesthelemma.

Theorem1.LetKbeakernelsuchthatK(x, x) <forallxX.Then,theregularizationalgorithm
2
basedonN() =kk Kisdistributionallystableforthel1
1
2 2
distancewith
Proof.ForN( ) =
2
2 max2
2,andforthel2distancewith
,wehaveB
(h
kkK
h) =
(K)
2,thusB
(h
N
kK
h)+B
(h
N
h) =
2khkK andbyLemma1,
2k
l1(W, W ) sup h(x)
kK
l1(W, W ) h
xS
Thus khkK
l1 (W,W)
.By admissibilityof c,
z X Y, |c(h , z) c(h, z)| |h(x)| khkK .

Therefore,
z X Y, |c(h, z) c(h, z)|
2 2 l1(W, W)
(18)
(19)
(20)
which shows thewhich shows the
distributional stability of a
kernelbased regularizationdistributional stability
algorithmforthe l1 distance.of a kernelbased
Using Lemma 2, a similarregularization
derivationleadsto
algorithm for the l2
distance.
z
X
Note that the standard
corresponding distribution,
then
|| ||K
1
l1(WU , WU ) =
m1
i=1
X
m 1 1
2
=m .
(22)
Thus,inthecaseofkernelbasedregularizedalgorithmsandforthel1distance,standard
uniformstabilityisaspecialcaseofdistributionalstability.Itcanbeshown
(
similarlythatl2
U ) =
m(m1)
EffectofEstimationErrorforKernelBasedRegularization
Algorithms
Thissectionanalyzestheeffectofanerrorintheestimationoftheweightofatrain
ingexampleonthegeneralizationerrorofthehypothesis h returnedbyaweight
sensitivelearningalgorithm.Wewillexaminetwoestimationtechniques:astraight
forwardhistogrambasedorclusterbasedmethod,andkernelmeanmatching(KMM)
(Huangetal.,2006b).
4.1
ClusterBasedEstimation
A straightforward estimate of the probability of sampling is based on the observed

empiricalfrequencies.Theratioofthenumberoftimesapoint x appearsin S andthe
number of times it appears in U is an empirical estimate of Pr[s = 1|x]. Note that
generalizationtounseenpointsxisnotneededsincereweightingrequiresonlyassigning
weightstotheseentrainingpoints.However,ingeneral,traininginstancesaretypically
uniqueorveryinfrequentsincefeaturesarerealvaluednumbers.Instead,featurescanbe
discretizedbasedonapartitioningoftheinputspaceX.Thepartitioningmaybebasedon
asimplehistogrambucketsortheresultofaclusteringtechnique.Theanalysisofthis
sectionassumessuchapriorpartitioningofX.
Weshallanalyzehowfasttheresultingempiricalfrequenciesconvergetothetruesampling
probability.ForxU,letUxdenotethesubsampleofUcontainingexactlyalltheinstancesofx
andletn=|U|andnx=|Ux|.Furthermore,letn denotethenumberofuniquepointsinthesample
U .Similarly,wedefineSx, m, mx andm fortheset S.Additionally,denotebyp0 = minxU

Pr[x] =60.
Lemma3.Let >0.Then,withprobabilityatleast1,thefollowinginequalityholdsforallx
inS:
mx
Pr[s = 1|x]
Proof.Forafixed x
Pr Pr[s = 1|x] mx
i=1
h
nx
p0 n
(23)
inequality,
U ,byHoeffding's
nx
log 2m + log
mx | | n = i
x
Pr | Pr[s = 1|x]
x
Pr[nx = i]
X
n
2i2
X
2
Pr[nx = i].
i=1
SincenxisabinomialrandomvariablewithparametersPrU[x] =pxandn,thislasttermcanbe
expressedmoreexplicitlyandboundedasfollows:
n
i=1
e2i
Pr[nx
U
= ]2
i=0
e2i
n !p
i
(1
x
px
ni
pxe2
= 2(
+ (1
X
= 2(1 px (1 e22 ))n 2 exp(pxn(1 e22 )).
px
))
Sinceforx[0,1],1e x/2,thisshowsthatfor[0,1],
Pr Pr[s = 1|x]
U
mx
nx
thedefinitionof
px n
e
(24)
p0
Bytheunionboundand
mx
Pr
U
x S : Pr[s = 1|x] nx
2mep0 n
Settingtomatchtheupperboundyieldsthestatementofthelemma.
ThefollowingpropositionboundsthedistancebetweenthedistributionWcorresponding
toaperfectlyreweightedsample(SW)andtheonecorrespondingtoasamplethat
c
isreweightedaccordingtotheobservedbias(SW ).Forasampledpointx =x,these

distributionsaredefinedasfollows:
1
1
(xi) = m p(x )
W
1
1 ,
andW(xi) = m p(x )
c
i
where,foradistinctpointxequaltothesampledpointx
i,wedefinep(xi) = Pr[s=
mx
1|x] and p(xi ) = nx .

Proposition3.LetB=
i=1,...,m
max max(1/p(xi), 1/p(xi)).Then,the l1 and l2 distances
ofthedistributionsWandWcanbeboundedasfollows,
c
s log 2
l1(W, W) B2
log 2m + log
andl2(W,W)B2s
m + log
p0n
p0nm
(26)
Proof.Bydefinitionofthe l2 distance,
)=1
l ( ,
i=1
1
p(x i )
1 m i max(p(xi)
= 1
2
p(x i )
B4
p(xi) p(xi )
p(x i )p(x i)
i=1
X
2
p(xi)) .
2
Itcanbeshownsimilarlythatl1(W,W)B maxi|p(xi)p(xi)|.Theapplication
oftheuniformconvergenceboundofLemma3directlyyieldsthestatementofthe
proposition.
Thefollowingtheoremprovidesaboundonthedifferencebetweenthegeneralization
error of the hypothesis returned by a kernelbased regularization algorithm when
trainedontheperfectlyunbiaseddistribution,andtheonetrainedonthesamplebias
correctedusingfrequencyestimates.
Theorem2. Let K beaPDSkernelsuchthat K(x, x) < forall x X .Let hW bethe
2
W
c
hypothesisreturnedbytheregularizationalgorithmbasedon N () = kk K
usingS ,andh
theonereturnedaftertrainingthesamealgorithmonSW .Then,
(25)
forany >0,withprobabilityatleast1,thedifferenceingeneralizationerrorof
thesehypothesesisboundedasfollows
log 2m
2 2 2
|R(hW) R(hW)|
+ log
(27)
p0n
log 2m + log
2 max2 (K)B2
|R(hW) R(hW)|
p0nm
Proof. The result follows from Proposition 2, the distributional stability and the bounds on the
stabilitycoefficientforkernelbasedregularizationalgorithms(Theorem1),andtheboundsonthe
l1andl2distancesbetweenthecorrectdistributionWandthe
c
estimateW.
Let n0 bethenumberofoccurrences,in U ,oftheleastfrequenttrainingexample.For

largeenoughn,p0nn0,thusthetheoremsuggeststhatthedifferenceoferror
ratebetweenthehypothesisreturnedafteranoptimalreweightingversustheonebasedq
log m
onfrequencyestimatesgoestozeroas
n0
.Inpractice,m m,thenumberof
distinctpointsinSissmall,afortiori,logm isverysmall,thus,theconvergenceratedepends
essentiallyontherateatwhich n0 increases.Additionally,if max(K) m (suchaswith
Gaussiankernels),thel2basedboundwillprovideconvergencethatisatleastasfast.
4.2
KernelMeanMatching
ThefollowingdefinitionsintroducedbySteinwart(2002)willbeneededforthepre
sentation and discussion of the kernel mean matching (KMM) technique. Let X be a
compactmetricspaceandletC(X ) denotethespaceofallcontinuousfunctionsover X
equippedwiththestandardinfinitenorm k k.Let K : X X R beaPDS kernel.
ThereexistsaHilbertspaceFandamap:XFsuchthatforallx, yX,K(x, y) =
h(x), (y)i.Notethatforagivenkernel K, F and arenotuniqueand that,forthese
definitions,FdoesnotneedtobeareproducingkernelHilbertspace(RKHS).
LetPdenotethesetofallprobabilitydistributionsoverXandlet:P Fbe
thefunctiondefinedby
p P, (p) = E [(x)].
xp
Afunctiong:XRissaidtobeinducedbyKifthereexistswFsuchthatforallx X , g(x) = hw,

(x)i. K issaidtobe universal ifitiscontinuousandifthesetof functionsinducedbyKaredenseinC(X).
Theorem3(Huangetal.(2006a)).LetFbeaseparableHilbertspaceandletKbea
universal kernel withfeaturespace F andfeaturemap : X F .Then, is
injective.
Proof.Wegiveafullproofofthemaintheoremsupportingthistechniqueintheap
pendix.TheproofgivenbyHuangetal.(2006a)doesnotseemtobecomplete.
(28)
TheKMMtechniqueisapplicablewhenthelearningalgorithmisbasedonauniversal
kernel.Thetheoremshowsthatforauniversalkernel,theexpectedvalueofthefea
turevectorsinduceduniquelydeterminestheprobabilitydistribution.KMMusesthis
propertytoreweighttrainingpointssothattheaveragevalueofthefeaturevectorsfor
thetrainingdatamatchesthat ofthefeaturevectorsforasetofunlabeledpoints
drawnfromtheunbiaseddistribution.
Let i denotetheperfectreweightingofthesample
point xi and bi theestimate
derivedbyKMM.LetB denotethelargestpossiblereweightingcoefficientandlet be
a positive realnumber. We will assume that is chosen so that 1/2. Then, the
followingistheKMMconstraintoptimization
min G() =
(x )
k
m i=1
(x )
i
i 1
X
i=1
(29)
subjecttoi[0, B]
k
n i=1
Letbethesolutionofthisoptimizationproblem,then
.For i [1, m],let i
mi=1
= i/(1 + ).The
normalizedweightsusedin
P
with
KMM'sreweightingofthesamplearethusdefinedby
i/m
initeuniversalkernelK,wedenotebyK
Rm
i=1 i
=1
X andastrictlypositivedef
P
Asintheprevioussection,givenx1, . . . , xm
= 1 + with
thekernelmatrixdefinedby
Kij = K(xi, xj ) andby min(K) > 0 thesmallesteigenvalueof K.Wealsodenote bycond(K)the

conditionnumberofthematrixK:cond(K) =max(K)/min(K).WhenKisuniversal,itiscontinuous
overthecompactXXandthusbounded,and
thereexists <suchthatK(x, x)forallxX.
Proposition4.LetKbeastrictlypositivedefiniteuniversalkernel.Then,forany >
0,withprobabilityatleast 1 ,the l2 distanceofthedistributions b /m and /m

is boundedasfollows:
1
1
m
2B
2 2
min
B2
(K)
mn
1+
.(30)
2 log
r
b
Proof.Sincetheoptimalreweightingverifiestheconstraintsoftheoptimization,by
definitionofbasaminimizer,G(b)G().Thus,bythetriangleinequality,
1
km
i=1
(31)
i(xi ) m i=1 i(xi )k G() + G() 2G().
X b
X
b
LetLdenotethelefthandsideofthisinequality:L=
definitionofthenormintheHilbertspace,
L=
1
m
i=1(i
)K(
P
i)(xi )k.By
) .Then,bythe
matrices,
standardboundsfortheRayleighquotientofPDS
min( )
2G() .
1
2
K
b
ThiscombinedwithInequality31yields
1
( )
mb
kb
) 2.
k
(32)
mkb
min2 (K)
Thus,bythetriangleinequality,
1
1
+ m k(b
m k(b
2G()
)k2
1
1
min2 (K)
m k(b
)
k2
(
3
2G()
3
1
|
min2 (K) 2 )
kk
+
b)k2
2| |B
+
S/m and hb
2G()
1
2 (K)
min
Let h be the
the r ith el.
hypothesis returned
It is
a b m bytheregularization
algorithm 2based on
follo
not
l y wh N() =kk Kusing t
h
wing i en
e
diffi
o
n
z
a
trai
hold
e
cult
r
a ned
e
s
t
to
t k on
u
r
(Le i e the
n
sho
e
d
o
r
tru
mma
a
w
f
nne
t
4,
e
e dist
usin
r
t
r
(Hua e l rib
a
g
i
r uti
n
nget
i
McD
r b on,
n
g
al., o a and
t
iarm
h
e
r
s
the
2006
s
id's
a
o e one
m
a)):
e
ineq
f d trai
a
l
t ned
g
o
r
h r on
i
y
t
e e the
h
m
h g sa
that
o
ThiscombinedwithInequality33yieldsthestatementoftheproposition.
y u mp
n
for
S
The p l le
b
any follo o a bia

/
m
wing
t
r
s
.
>
T
h
theor h i cor
e
0, em e z rec
n
,
f
with provi s a ted
o
r
prob desa i t K any > 0,
boun s i M with
a d on r o M. probabilityat
bilit the e n Theoleast 1 ,
diffe t rem
4. thedifference
Let in
y at
be
rence u a K
a
generalizatio
least betw r l strict
ly
posit n error of
1 een n g ive
defin these
the e o ite
sym hypotheses is
, gene d r metr
ic bounded as
univ
ualit
=
1
0,
|
R(h
the
)
R(
hb bo
)|
2 un
m
d
ax2
(K) be
co
me
Fo s
r
B1
m
i
mn2
+
(
K
)
ersal follows
kern
m +n1 +
co
ns
ta
nt
Pr
[s
=
Proof.
1]
C w
o hi
m ch
pa is
ri no
ng t
thi in
s cl
bo ud
un ed
d in
fo th
r e
cl
= us
0 te
wi r
th ba
th se
e d
l2 re
bo w
un ei
d gh
of ti
T ng
he .
or T
e hu
m s,
4, th
we
e cl
fir us
st te
no r
te ba
th se
at d
B co
an nv
d er
B ge
ar nc
e e
es is
se of
nti th
all e
y or
rel de
at r
ed
m
od
ul
o
th
e
r
r 2 log .
2
O(max (K)B
of
th
e
cl
u
st
er
b
as
e
d
b
o
u
n
d
is
m
or
e
fa
v
or
a
bl
e,
w
hi
le
fo
r
ot
h
er
v
al
u
es
th
e
K
M
M
b
o
u
n
d
c
o
n
v
er
g
es
fa
st
er
.
1
andtheKMMconvergenceoftheorderO(cond2(K)
Takingtheratio
expressionO
log m
p0nm
B
m
).
1
oftheformeroverthelatterandnoticingp0 O(B),weobtainthe
min(
K)B log m
.Thus,forn > min(K)Blog(m )theconvergence
ExperimentalResults
Inthissection,wewillcomparetheperformanceoftheclusterbasedreweighting
techniqueandtheKMMtechniqueempirically.Wewillfirstdiscussandanalyzethe
propertiesoftheclusteringmethodandourparticularimplementation.
TheanalysisofSection4.1dealswithdiscretepointspossiblyresultingfromtheuse
ofaquantizationorclusteringtechnique.However,duetotherelativelysmallsizeofthe
publictrainingsetsavailable,clusteringcouldleaveuswithfewclusterrepresentativesto
trainwith.Instead,inourexperiments,weonlyusedtheclusterstoestimatesampling
probabilitiesandappliedtheseweightstothefullsetoftrainingpoints.Asthefollowing
proposition shows, the l1 and l2 distance bounds of Proposition 5 do not change
significantlysolongastheclustersizeisroughlyuniformandthesamplingprobabilityis
thesameforallpointswithinacluster.Wewillrefertothisastheclusteringassumption.
Inwhat follows, let Pr[s = 1|Ci] designate the sampling probability for all x Ci.
Finally,defineq(Ci) = Pr[s= 1|Ci]andq(Ci) =|CiS|/|CiU|.
Proposition5.LetB=
max max(1/q(Ci), 1/q(Ci)).Then,the l1 and l2 distances
i=1,...,m
ofthedistributionsWandWcanbeboundedasfollows,
CM k(log 2k + log 1 )
l1 (W, W) B
| |
q 0 nm
whereq0= minq(Ci)and|CM|= maxi|Ci|.

Proof.Bydefinitionofthe l2 distance,
2
l2
(W, W) =
(W, W) B
l2
XX
i=1 xC i
1
p(x)
B4 |CM | k
p(x)
m2
2
max(q(Ci ) q(Ci ))
m2
i=1
|CM |k(log 2k + log 1 )

s
q0 nm2
i=1 xCi
XX
1
q(C i )
q(C i )
unique points. ecision regression (Breiman et al.,

The righthand side of the
tree
1984). Points with similar
first line follows from the

selects
features and labels are
Note that when
binary clusteredtogetherinthisway
cluster ing assumption and
the cluster size
the inequality then followsisuniform,then cuts onwith the assumption that
these will also have similar
from exactly the same steps|C |k = m, the
M
coordina
samplingprobabilities.
as in Proposition 5 andand the bound
tes of x
Several methods for bias
factoringawaythesumoverabove leads to
Xthatcorrection are compared in
theelementsofCi.Finally,itan expression greedily
Table1.Eachmethodassigns
is easy to see that thesimilartothatof minimiz corrective weights to the
maxi(q(Ci ) q(Ci)) termProposition5. eanodetraining samples. The
can be bounded just as in
Weusedthe impurity unweighted method uses
Lemma 3 using a uniformleaves of a mea
weight1for
everytraininginstance.Theideal
convergencebound, howeverdecision tree to sure,
1
methodusesweight
,
e.g.,
define
the
whichisoptimalbut
nowtheunionboundistaken
Pr[s=1|x]
MSE for
over the clusters rather thanclusters. A d
Table1. Normalizedmeansquarederror(NMSE)forvariousregressiondatasetsusingun
weighted,ideal,clusteredandkernelmeanmatchedtrainingsamplereweightings.
DATASET
ABALONE
BANK32NH
BANK8FM
CALHOUSING
CPUACT
CPUSMALL
HOUSING
KIN8NM
PUMA8NH
|U |
2000
4500
4499
16512
4000
4000
300
5000
4499
IDEAL
CLUSTERED
KMM
|S| ntest UNWEIGHTED
7242177 .654.019
.551.032.623.034.709.122
.610.044.635.046.691.055
23843693 .903.022
.058.001.068.002.079.013
19983693 .085.003
.360.009.375.010.595.054
95114128 .395.010
.523.080.568.018.518.237
24004192 .673.014
.477.097.408.071.531.280
23684192 .682.053
.390.053 .482.042
.469.148
116206 .509.049
.523.045 .574.018
.704.068
25103192 .594.008
.674.019 .641.012
.903.059
22463693 .685.013
requiresthesamplingdistributiontobeknown.Theclusteredmethodusesweight|Ci U |/|Ci S|,where

theclusters Ci areregressiontreeleaveswithaminimum countof4(largerclustersizesshowedsimilar,though
declining,performance).The
KMMmethodusestheapproachofHuangetal.(2006b)withaGaussiankernelandp
d
parameters=d/2forxR ,B= 1000,= 0.Notethatweknowofnoprincipledwaytodocross

validationwithKMMsinceitcannotproduceweightsfora
heldoutset(Sugiyamaetal.,2008).
TheregressiondatasetsarefromLIAAD andaresampledwithP[s= 1|x] =
1+e
wherev=4w(xx),xRdandwRdchosenatrandomfrom[1,1]d.Inour
w(xx)
experiments,wechosetenrandomprojectionswandreportedresultswiththew,for
each data set, that maximizes the difference between the unweighted and ideal
methodsoverrepeatedsamplingtrials.Inthisway,weselectedbiassamplingsthat
aregoodcandidatesforbiascorrectionestimation.
5
Forourexperiments,weusedaversionofSVRavailablefromLibSVM thatcan
takeasinputweightedsamples,withparametervaluesC= 1,and= 0.1combined
withaGaussiankernelwithparameter= d/2.Wereportresultsusingnormalized
meansquarederror(NMSE):1
ntest (y
ntest
y )
y2
deviationsfortenfoldcross
validation.
Ourresultsshowthatreweightingwithmorereliablecounts,duetoclustering,canbe
effective in the problem of sample bias correction. These results also confirm the
dependencethatourtheoreticalboundsexhibitonthequantity n0.Theresultsobtained
6
usingKMMseemtobeconsistentwiththosereportedbytheauthorsofthistechnique.
Conclusion
Wepresentedageneralanalysisofsampleselectionbiascorrectionandgavebounds
analyzingtheeffectofanestimationerrorontheaccuracyofthehypothesesreturned.
Thenotionofdistributionalstabilityandthetechniquespresentedaregeneralandcan
4
www.liaad.up.pt/ltorgo/Regression/DataSets.html.
www.csie.ntu.edu.tw/cjlin/libsvmtools.
6WethankArthurGrettonfordiscussionandhelpinclarifyingthechoiceoftheparameters
anddesignoftheKMMexperimentsreportedin(Huangetal.,2006b),andforprovidingthe
codeusedbytheauthorsforcomparisonstudies.
beofindependentinterestfortheanalysisoflearningalgorithmsinothersettings.In
particular,thesetechniquesapplysimilarlytootherimportanceweightingalgorithms
andcanbeusedinothercontextssuchthatoflearninginthepresenceofuncertain
labels. The analysis of the discriminative method of (Bickel et al., 2007) for the
problemofcovariateshiftcouldperhapsalsobenefitfromthisstudy.
References
Bickel,S.,Bruckner,M.,&Scheffer,T.(2007).Discriminativelearningfordiffering
trainingandtestdistributions.ICML2007(pp.8188).
Bousquet,O.,&Elisseeff,A.(2002).Stabilityandgeneralization.JMLR,2,499526.
Breiman,L.,Friedman,J.,Olshen,R.,&Stone,C.(1984).Classificationandregression
trees.BocaRaton,FL:CRCPress.
Cortes,C.,&Vapnik,V.N.(1995).SupportVectorNetworks. MachineLearning,
20,273297.
Devroye, L., & Wagner, T. (1979). Distributionfree performance bounds for
potentialfunctionrules.IEEETrans.onInformationTheory(pp.601604).
Dudk,M.,Schapire,R.E.,&Phillips,S.J.(2006).Correctingsampleselectionbias
inmaximumentropydensityestimation.NIPS2005.
Elkan,C.(2001).Thefoundationsofcostsensitivelearning. IJCAI (pp.973978).
Fan,W.,Davidson,I.,Zadrozny,B.,&Yu,P.S.(2005).Animprovedcategorization
ofclassifier'ssensitivityonsampleselectionbias.ICDM2005(pp.605608).
IEEEComputerSociety.
Heckman,J.J.(1979).SampleSelectionBiasasaSpecificationError.Econometrica,
47,151161.
Huang, J., Smola, A., Gretton, A., Borgwardt, K., & Scholkopf, B. (2006a).
CorrectingSampleSelectionBiasbyUnlabeledData.TechnicalReportCS2006
44).UniversityofWaterloo.
Huang,J.,Smola,A.J.,Gretton,A.,Borgwardt,K.M.,&Scholkopf,B.(2006b).
Correctingsampleselectionbiasbyunlabeleddata.NIPS2006(pp.601608).
Kearns,M.,&Ron,D.(1997).Algorithmicstabilityandsanitycheckboundsfor
leaveoneoutcrossvalidation.COLT1997(pp.152162).
Little,R.J.A.,&Rubin,D.B.(1986). Statisticalanalysiswithmissingdata.New
York,NY,USA:JohnWiley&Sons,Inc.
Saunders,C.,Gammerman,A.,&Vovk,V.(1998).RidgeRegressionLearningAlgo
rithminDualVariables.ICML1998(pp.515521).
Steinwart,I.(2002).Ontheinfluenceofthekernelontheconsistencyofsupport
vectormachines.JMLR,2,6793.
Sugiyama,M.,Nakajima,S.,Kashima,H.,vonBunau,P.,&Kawanabe,M.(2008).
Directimportanceestimationwithmodelselectionanditsapplicationtocovariate
shiftadaptation.NIPS2008.
Vapnik, V. N. (1998). Statistical learning theory. New York: WileyInterscience.
Zadrozny,B.(2004).Learningandevaluatingclassifiersundersampleselectionbias.
ICML2004.
Zadrozny, B., Langford, J., & Abe, N. (2003). Costsensitive learning by cost
proportionateexampleweighting.ICDM2003.
ProofofTheorem3
Proof.Assumethat(p) =(q)fortwoprobabilitydistributionspandqinP.Itis
knownthatifExp[f(x)] = Exq[f(x)]foranyfC(X),thenp=q.LetfC(X)
andfix
> 0.Since K isuniversal,thereexistsafunction g inducedby K suchthat
kf gk . Exp[f (x)] Exq [f (x)] canberewrittenas
xEp[f
E [f (x)]
xp
Since
Exp |f (x) g(x)| kf gk andsimilarly
Since Exp[f (x) g(x)]

Ex q [f (x)
g(x)] ,
(36)
(x) g(x)] + xEp[g(x)] xEq[g(x)] + xEq[g(x) f (x)].
isinduced
by
E [f (x)]
xq
E [g(x)]
xp
,thereexistsw
suchthatforallx
xq
(37)
+ 2.
E [g(x)]
,g(x) =
w, (x) .
letwn=hw, eniandn(x) =h(x), eni.

N
N,considerthepartialsum
Then,
N
gN (x) =
inequality,
n=0
n=0
wn n(x)
).BytheCauchySchwarz
wnn (x
|gN (x)| k wnenk2
n=0
.Foreach
g(x) =
SinceFisseparable,itadmitsacountableorthonormalbasis(en)nN.FornN,
1/2
k n (x)en k2
1/2
kwk2
1/2
1/2
k(x)k2
(38)
n=0
X
X
SinceKisuniversal,itiscontinuousandthusisalsocontinuous(Steinwart,2002).
Thusx7k(x)k2isacontinuousfunctionoverthecompactXandadmitsanupper
boundB0.Thus,|gN(x)|
kwk2B.Theintegral
kwk2B dp isclearlywell
definedandequals
w
p
R p
2B.
k k
holds:
Thus,bytheLebesguedominatedconvergencetheorem,
thefollowing
E [g(x)] =
n=0
w (x)dp(x) =
xp
n=0
n(x)dp(x).
(39)
n n
BydefinitionofEx p[(x)],thelasttermistheinnerproductofwandthatterm.Thus,
E [g(x)] =
xp
w, xp (x)
(40)
= h w, (p)
i.
Asimilarequalityholdswiththedistributionq,thus,
xEp[g(x)]
xEq[g(x)] = hw, (p) (q)i = 0.
Thus,Inequality37canberewrittenas
xp
forall >0.Thisimplies
injectivityof.
E [f (x)]
xq
(41)
E [f (x)] 2,
Ex p[f (x)] = Exq
[f (x)] forall
C(X ) andthe

Sample Selection Bias Correction Theory Analysis

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Sample Selection Bias Correction Theory Analysis

Hochgeladen von

Copyright:

Verfügbare Formate

SampleSelectionBiasCorrectionTheory

distribution D over X Y . Thissamplebiascanberepresentedbyarandomvariable

respecttothetruedistributionD,R(h) = E(x,y)D[c(h, z)],wherec(h, z)isthecostof

E [Rw (h)] = R(h) = E [c(h, z)].

E [w1 c(h, z1)].

andD coincide,the ) c(h, z1) = (4)

Pr[z|s = 1] Pr[s = 1] = Pr[s = 1] Pr[z].

Thus,ifweweregiventheprobabilitiesPr[s= 1]andPr[s= 1|z],wecouldderive

notbeestimatedforallpoints z butonlyforthosefallinginthetrainingsample S.By

z X Y, |c(hW, z) c(hW , z)| d(W, W ).

|R(hW) R(hW )| d(W, W ).

E T [|c(z, hW) c(z, hW )|] d(W, W ),

R+suchthatforanytwohypothesesh, h Handforallz= (x, y)XY,

|c(h, z) c(h , z)| |h(x) h (x)|.

Lemma1. Letthehypothesisset H beavectorspace.Assumethat N isaproper closedconvex

c(h , zi)W i c(h, zi)W i + c(h, zi)W i

(c(h , zi) c(h, zi))(Wi

|h(xi)||Wi Wi| l1(W, W) sup |h(x)|,

Kij = K(xi, xj ) andby max(K) R+

(c(h , zi) c(h, zi))(Wi Wi ) .

Wi denote Wi Wi, W thevectorwhosecompo

matrices.Sincek Wk2=l2(W,W ),thisprovesthelemma.

l1(W, W ) sup h(x)

z X Y, |c(h , z) c(h, z)| |h(x)| khkK .

A straightforward estimate of the probability of sampling is based on the observed

U .Similarly,wedefineSx, m, mx andm fortheset S.Additionally,denotebyp0 = minxU

isreweightedaccordingtotheobservedbias(SW ).Forasampledpointx =x,these

1|x] and p(xi ) = nx .

max max(1/p(xi), 1/p(xi)).Then,the l1 and l2 distances

Let n0 bethenumberofoccurrences,in U ,oftheleastfrequenttrainingexample.For

Afunctiong:XRissaidtobeinducedbyKifthereexistswFsuchthatforallx X , g(x) = hw,

.For i [1, m],let i

Kij = K(xi, xj ) andby min(K) > 0 thesmallesteigenvalueof K.Wealsodenote bycond(K)the

thereexists <suchthatK(x, x)forallxX.

0,withprobabilityatleast 1 ,the l2 distanceofthedistributions b /m and /m

i(xi ) m i=1 i(xi )k G() + G() 2G().

any follo o a bia

.Thus,forn > min(K)Blog(m )theconvergence

max max(1/q(Ci), 1/q(Ci)).Then,the l1 and l2 distances

whereq0= minq(Ci)and|CM|= maxi|Ci|.

|CM |k(log 2k + log 1 )

unique points. ecision regression (Breiman et al.,

first line follows from the

requiresthesamplingdistributiontobeknown.Theclusteredmethodusesweight|Ci U |/|Ci S|,where

parameters=d/2forxR ,B= 1000,= 0.Notethatweknowofnoprincipledwaytodocross

TheregressiondatasetsarefromLIAAD andaresampledwithP[s= 1|x] =

Exp |f (x) g(x)| kf gk andsimilarly

Since Exp[f (x) g(x)]

(x) g(x)] + xEp[g(x)] xEq[g(x)] + xEq[g(x) f (x)].

letwn=hw, eniandn(x) =h(x), eni.

|gN (x)| k wnenk2

xEq[g(x)] = hw, (p) (q)i = 0.

Ex p[f (x)] = Exq

Das könnte Ihnen auch gefallen