Sie sind auf Seite 1von 25

SampleSelectionBiasCorrectionTheory

CorinnaCortes ,MehryarMohri

2,1

,MichaelRiley ,andAfshinRostamizadeh

GoogleResearch,
76NinthAvenue,NewYork,NY10011.
12Sciences,251MercerStreet,NewYork,NY
Courant Institute of Mathematical
10012.

Abstract. Thispaperpresentsatheoreticalanalysisofsampleselectionbiascor
rection.Thesamplebiascorrectiontechniquecommonlyusedinmachinelearning
consistsofreweightingthecostofanerroroneachtrainingpointofabiasedsample
tomorecloselyreflecttheunbiaseddistribution.Thisreliesonweightsderivedby
variousestimationtechniquesbasedonfinitesamples.Weanalyzetheeffectofan
errorinthatestimationontheaccuracyofthehypothesisreturnedbythelearning
algorithmfortwoestimationtechniques:aclusterbasedestimationtechniqueand
kernel mean matching. We also report the results of sample bias correction
experimentswithseveraldatasetsusingthesetechniques.Ouranalysisisbasedon
thenovelconceptofdistributionalstability whichgeneralizestheexistingconcept
ofpointbasedstability. Much ofourwork andprooftechniques canbeusedto
analyzeotherimportanceweightingtechniquesandtheireffectonaccuracywhen
usingadistributionallystablealgorithm.

Introduction

Inthestandardformulationofmachinelearningproblems,thelearningalgorithmre
ceivestrainingandtestsamplesdrawnaccordingtothesamedistribution.However,
thisassumptionoftendoesnotholdinpractice.Thetrainingsampleavailableisbi
asedinsomeway,whichmaybeduetoavarietyofpracticalreasonssuchasthecost
ofdatalabelingoracquisition.Theproblemoccursinmanyareassuchasastronomy,
econometrics,andspecieshabitatmodeling.
Inacommoninstanceofthisproblem,pointsaredrawnaccordingtothetestdis
tributionbut notall ofthem aremadeavailabletothelearner.Thisiscalledthe
sampleselectionbiasproblem.Remarkably,itisoftenpossibletocorrectthisbiasby
usinglargeamountsofunlabeleddata.
Theproblemofsampleselectionbiascorrectionforlinearregressionhasbeenex
tensivelystudiedineconometricsandstatistics(Heckman,1979;Little&Rubin,1986)
with the pioneering work of Heckman (1979). Several recentmachine learning publi
cations(Elkan,2001;Zadrozny,2004;Zadroznyetal.,2003;Fanetal.,2005;Dudket
al.,2006)havealsodealtwiththisproblem.Themaincorrectiontechniqueusedinallof
thesepublicationsconsistsofreweightingthecostoftrainingpointerrorstomoreclosely
reflectthatofthetestdistribution.Thisisinfactatechniquecommonlyusedinstatistics
andmachinelearningforavarietyofproblemsofthistype(Little&Rubin,1986).With
theexactweights,thisreweightingcouldoptimallycorrectthebias,but,inpractice,the
weightsarebasedonanestimateofthesamplingprobabilityfromfinitedatasets.Thus,it
isimportanttodeterminetowhatextenttheerrorinthisestimation

canaffecttheaccuracyofthehypothesisreturnedbythelearningalgorithm.Toour
knowledge,thisproblemhasnotbeenanalyzedinageneralmanner.
Thispapergivesatheoreticalanalysisofsampleselectionbiascorrection.Our
analysisisbasedonthenovelconceptof distributionalstability whichgeneralizes
thepointbasedstabilityintroducedandanalyzedbypreviousauthors(Devroye&
Wagner,1979;Kearns&Ron,1997;Bousquet&Elisseeff,2002).Weshowthat
large families of learning algorithms, including all kernelbased regularization
algorithmssuchasSupportVectorRegression(SVR)(Vapnik,1998)orkernelridge
regression (Saunders et al., 1998) are distributionally stable and we give the
expressionoftheirstabilitycoefficientforboththel1andl2distance.
Wethenanalyzetwocommonlyusedsamplebiascorrectiontechniques:acluster
basedestimationtechniqueandkernelmeanmatching(KMM)(Huangetal.,2006b).For
each ofthese techniques, we derive boundsonthe difference of the error rate ofthe
hypothesis returned by a distributionally stable algorithm when using that estimation
techniqueversususingperfectreweighting.Webrieflydiscussandcomparethesebounds
and also report theresults ofexperiments with both estimation techniques for several
publiclyavailablemachinelearningdatasets.Muchofourworkandprooftechniquescan
beusedtoanalyzeotherimportanceweightingtechniquesandtheireffectonaccuracy
whenusedincombinationwithadistributionallystablealgorithm.

Theremainingsectionsofthispaperareorganizedasfollows.Section2describes
in detail the sample selection bias correction technique. Section 3 introduces the
concept of distributional stabilityandprovesthedistributional stabilityof kernel
basedregularizationalgorithms.Section4analyzes the effect ofestimationerror
using distributionally stable algorithms for both the clusterbased and the KMM
estimationtechniques.Section5reportstheresultsofexperimentswithseveraldata
setscomparingtheseestimationtechniques.

SampleSelectionBiasCorrection

2.1

Problem

Let X denote the input space and Y the label set, which may be {0, 1} in
classificationoranymeasurablesubsetof R inregressionestimationproblems,and
letDdenotethetruedistributionoverXYaccordingtowhichtestpointsaredrawn.
Inthesampleselectionbiasproblem,somepairsz= (x, y)drawnaccordingtoDare
not made available to the learning algorithm. The learning algorithm receives a
trainingsample S of m labeledpoints z1, . . . , zm drawnaccordingtoa biased

distribution D over X Y . Thissamplebiascanberepresentedbyarandomvariable


stakingvaluesin{0,1}:whens= 1thepointissampled,otherwiseitisnot.Thus,

bydefinitionofthesampleselectionbias,thesupportofthebiaseddistributionD is
includedinthatofthetruedistributionD.
Asinstandardlearningscenarios,theobjectiveofthelearningalgorithmistoselecta
hypothesishoutofahypothesissetHwithasmallgeneralizationerrorR(h)with

respecttothetruedistributionD,R(h) = E(x,y)D[c(h, z)],wherec(h, z)isthecostof


theerrorofhonpointzXY.
WhilethesampleSiscollectedinsomebiasedmanner,itisoftenpossibletoderive
someinformationaboutthenatureofthebias.Thiscanbedonebyexploitinglarge

amountsofunlabeleddatadrawnaccordingtothetruedistributionD,whichisoften
availableinpractice.Thus,inthefollowingletUbeasampledrawnaccordingtoD
andSUalabeledbutbiasedsubsample.
2.2

WeightedSamples

AweightedsampleSwisatrainingsampleSofmlabeledpoints,z1, . . . , zmdrawni.i.d.
fromXY,thatisaugmentedwithanonnegativeweightwi0foreachpointzi.This
weightisusedtoemphasizeordeemphasizethecostofanerroron zi asin thesocalled
importanceweightingorcostsensitivelearning(Elkan,2001;Zadroznyetal.,2003).One

couldusetheweightswitoderiveanequivalentbutlargerunweightedsampleS where
the multiplicity of zi would reflect its weight wi, but most learning algorithms, e.g.,
decisiontrees,logisticregression,AdaBoost,SupportVectorMachines(SVMs),kernel
ridgeregression,candirectlyacceptaweightedsampleSw.Wewillrefertoalgorithms
thatcandirectlytakeSwasinputasweightsensitivealgorithms.
TheempiricalerrorofahypothesishonaweightedsampleSwisdefinedas
m

Rw (h) =
b

coincideswiththatofDand
Proposition1.LetD beadistribution
let)
forallpoints zi in S.
whosesupportSw beaweightedsample
Then,
with wi = PrD(zi)/ PrD (zi

i=1

wi c(h, zi).

(1)

E [Rw (h)] = R(h) = E [c(h, z)].


SD

pointsaredrawni.i.d.,

Proof.Sincethesample
b
E [Rw (h)] = 1
SD

E [wic(h, zi)] =

2.3
Bias
Correction
Theprobabilityof
drawing z = (x,
y) according to
the true but
unobserved
distribution D can
be
straightforwardly
related to the
observed

distributionD .By
definition of the
random variable
s, the observed
biased

distributionD can
be expressed by
PrD [z] = PrD[z|
s = 1]. We will
assume that all
points z in the
supportof D can
besampledwith a
nonzero
probabilitysothe
supportofDandD

coincide. Thus
for all z X
Y , Pr[s = 1|z]
=6 0. Then, by
the

Bayes

z 1D

z SD

E [w1 c(h, z1)].

(3)

Bydefinitionofw berewritten
andthefactthatthe asfollows
supportofDsidecan
Thislasttermis formula,
thedefinitionof
forall z in
the
generalization thesupport
errorR(h).
of D,

(2)

zD

andD coincide,the ) c(h, z1) = (4)


(z righthand
E [c(h, z1)].

z1D

Pr[z] =
D

Pr[z|s = 1] Pr[s = 1] = Pr[s = 1] Pr[z].


Pr[s = 1 z]
Pr[s = 1 z] D

(5)

Thus,ifweweregiventheprobabilitiesPr[s= 1]andPr[s= 1|z],wecouldderive


the trueprobability PrD from the biasedone PrD exactlyand correct the sample
selectionbias.

Itisimportanttonotethatthiscorrectionisonlyneededforthetrainingsample S,
sinceitistheonlysourceofselectionbias.Withaweightsensitivealgorithm,itsuffices
i

toreweighteachsamplez withtheweightw =

Pr[s=1]

.Thus,Pr[s= 1|z]need
Pr[s=1|zi ]

notbeestimatedforallpoints z butonlyforthosefallinginthetrainingsample S.By


Proposition1,theexpectedvalueoftheempiricalerrorafterreweightingisthesameasif
weweregivensamplesfromthetruedistributionandtheusualgeneralizationbounds
holdfor

and
R(h)

.
R(h)

Whenthesamplingprobabilityisindependentofthelabels,asitiscommonlyas
sumed in many settings (Zadrozny 2004; 2003), Pr[s = 1|z] = Pr[s = 1|x], and
Equation5canberewrittenas
Pr[z] =
D

Pr[s = 1] Pr[z].
Pr[s = 1 x] D

|
Inthatcase,theprobabilities Pr[s = 1] and Pr[s = 1|x] neededtoreconstitute PrD
fromPrDdonotdependonthelabelsandthuscanbeestimatedusingtheunlabeled
points in U . Moreover, as already mentioned, for weightsensitive algorithms, it
sufficestoestimatePr[s= 1|xi]forthepointsxiofthetrainingdata;nogeneralization
isneeded.
3
Asimplecaseiswhenthepointsaredefinedoveradiscreteset. Pr[s= 1|x]can
thenbeestimatedfromthefrequencymx/nx,wheremxdenotesthenumberoftimesx
appearedin S U and nx thenumberoftimes x appearedinthefulldataset U .
Pr[s = 1] canbeestimatedbythequantity |S|/|U |.However,since Pr[s = 1] isa
constantindependentofx,itsestimationisnotevennecessary.
IftheestimationofthesamplingprobabilityPr[s= 1|x]fromtheunlabeleddata
set U wereexact,thenthereweightingjustdiscussedcouldcorrectthesamplebias
optimally.Severaltechniqueshavebeencommonlyusedtoestimatethereweighting
quantities.But,theseestimateweightsarenotguaranteedtobeexact.Thenextsec
tionaddresseshowtheerrorinthatestimationaffectstheerrorrateofthehypothesis
returnedbythelearningalgorithm.

DistributionalStability

Here,wewillexaminetheeffectontheerrorofthehypothesisreturnedbythelearning
algorithminresponsetoachangeinthewaythetrainingpointsareweighted.Sincethe
weightsarenonnegative,wecanassumethattheyarenormalizedanddefineadistribu
tionoverthetrainingsample.Thisstudycanbeviewedasageneralizationofstability
analysiswhereasinglesamplepointischanged(Devroye&Wagner,1979;Kearns&
Ron,1997;Bousquet&Elisseeff,2002)tothemoregeneralcaseofdistributionalstability
wherethesample'sweightdistributionischanged.
Thus,inthissectionthesampleweightWofSWdefinesadistributionoverS.Fora
fixedlearningalgorithmLandafixedsampleS,wewilldenotebyhWthehypothesis
3

Thiscanbeasaresultofaquantizationorclustering

techniqueasdiscussedlater.

(6)

returnedbyLfortheweightedsampleSW.Wewilldenotebyd(W,W )adivergencemeasurefor

twodistributionsWandW .Therearemanystandardmeasuresforthedivergencesordistances
betweentwodistributions,includingtherelativeentropy,theHellingerdistance,andthe lp
distance.
Definition1(DistributionalStability).AlearningalgorithmLissaidtobedistributionally
stableforthedivergencemeasuredifforanytwoweightedsamplesSWandSW,

z X Y, |c(hW, z) c(hW , z)| d(W, W ).


Thus, an algorithm is distributionally stable when small changes to a weighted
sample'sdistribution,asmeasuredbyadivergenced,resultinasmallchangeinthe
costofanerroratanypoint.Thefollowingpropositionfollowsdirectlyfromthe
definitionofdistributionalstability.

(7)

Proposition2.LetLbeadistributionallystablealgorithmandlethW(hW)denote

).

thehypothesisreturnedbyLwhentrainedontheweightedsampleSW

Let WT denotethedistributionaccordingtowhichtestpointsaredrawn.Then,the

(resp.SW

followingholds

|R(hW) R(hW )| d(W, W ).


Proof.Bythedistributionalstabilityofthealgorithm,
z

(8)

E T [|c(z, hW) c(z, hW )|] d(W, W ),

(9)

whichimpliesthestatementoftheproposition.
3.1

DistributionalStabilityofKernelBasedRegularizationAlgorithms

Here,weshowthatkernelbasedregularizationalgorithmsaredistributionally stable.
Thisfamilyofalgorithmsincludes,amongothers,SupportVectorRegression(SVR)and
kernel ridge regression. Other algorithms such as those based on the relative entropy
regularizationcanbeshowntobedistributionallystableinasimilarwayasforpoint
basedstability.OurresultsalsoapplytoclassificationalgorithmssuchasSupportVector
Machine(SVM)(Cortes&Vapnik,1995)usingamarginbasedlossfunction l asin
(Bousquet&Elisseeff,2002).

Wewillassumethatthecostfunctioncisadmissible,thatisthereexists

R+suchthatforanytwohypothesesh, h Handforallz= (x, y)XY,

|c(h, z) c(h , z)| |h(x) h (x)|.

(10)

Thisassumptionholdsforthequadraticcostandmostothercostfunctionswhenthehy
pothesissetandthesetofoutputlabelsareboundedbysomeMR+:hH,x
X, |h(x)| M and y Y, |y| M .Wewillalsoassumethat c isdifferentiable.
This assumptionisinfactnotnecessaryandallofourresultsholdwithoutit,butitmakes
thepresentationsimpler.

Let N:HR+beafunctiondefinedoverthehypothesisset.Regularization
basedalgorithmsminimizeanobjectiveoftheform
R

FW (h) =

(h) + N (h),

(11)

where0isatradeoffparameter.WedenotebyBFtheBregmandivergenceasso
ciatedtoaconvexfunctionF,BF(fkg) =F(f)F(g) hfg,F(g)i,anddefine

h as h = h

h.

Lemma1. Letthehypothesisset H beavectorspace.Assumethat N isaproper closedconvex


functionandthatNisdifferentiable.AssumethatFWadmitsaminimizerh H and FW aminimizer
h

H .Then,thefollowingboundholds,

BN (h kh) + BN (hkh )

l1(W, W )

sup |h(x)|.
xS

= Bb

Proof.SinceBFW

+ BN andBF
N (h

k h) + B

(h h )

astheminimizersof

+ BN ,andaBregman

RW

divergenceisnonnegative, B

Bythedefinitionof

= Bb

RW

and

FW

FW

h and h

W (h
F

BFW (hkh) + BFW (hkh) = RFW (h) RFW (h) + RFW (h) RFW (h).

Thus,bythe

Wi

admissibilityof

thecostfunction ,usingthenotation
b

(13)
(xi) and

bc

h ).

h) + B FW (h

= W (xi),
(h

BN (h
m

1=
=

h) + B

(h
N

h )

RFW (h) + RF

c(h , zi)W i c(h, zi)W i + c(h, zi)W i


i=1
m

(c(h , zi) c(h, zi))(Wi

c(h , zi)W i

(h)

RF

(h )

Wi )

(14)

i=1

m
X

|h(xi)||Wi Wi| l1(W, W) sup |h(x)|,


xS

i=1

whichestablishesthelemma.
Given

x1

...,
m
x

byKR

andapositivedefinitesymmetric(PDS)kernel

X
thekernelmatrixdefinedby

,wedenote
K

Kij = K(xi, xj ) andby max(K) R+

thelargesteigenvalueofK.

Lemma 2. Let H be a reproducing kernel Hilbert space with kernel K and let the
2
regularizationfunctionNbedefinedbyN() =kk K.Then,thefollowingboundholds,
1

max2

(K) l (
2

BN (h kh) + BN (hkh )

Proof.AsintheproofofLemma1,

BN (h kh) + BN (hkh )

m
i=1

WW
khk2.

(15)

(c(h , zi) c(h, zi))(Wi Wi ) .

(16)

X
BydefinitionofareproducingkernelHilbertspaceH,foranyhypothesishH,x X,
h(x) = hh, K(x, )i andthusalsoforany h = h h with h, h H , x

Wi denote Wi Wi, W thevectorwhosecompo


X, h(x) = hh, K(x, )i.Let

nentsarethe
's,andletVdenoteB (h h) + B (h
h ).Using admissibility,
Wi
k
m
N
k
V
i=1
|h(x ) Wi| =
| hh,
W K(x , )i|.Let {1, +1}
denotethesignof
h,
W
P
P
h
N

i=1

iK(xi, )i.Then,

V h,

WiK(xi, ) khkK k

i=1

i WiK(xi, )kK

i=1

X
Wj K(xi, xj )

ij

= khkK
i,j=1

(17)

Wi

X
1

(W )K (W )

= khkK

1/2

Wk2max2 (K).

khkK k

inequalityfollowsfromtheCauchySchwarzinequality
Inthisderivation,thesecond
andthelastinequalityfromthestandardpropertyoftheRayleighquotientforPDS

matrices.Sincek Wk2=l2(W,W ),thisprovesthelemma.


Theorem1.LetKbeakernelsuchthatK(x, x) <forallxX.Then,theregularizationalgorithm
2
basedonN() =kk Kisdistributionallystableforthel1
1

2 2

distancewith
Proof.ForN( ) =
2

2 max2

2,andforthel2distancewith

,wehaveB

(h

kkK

h) =

(K)

2,thusB

(h
N

kK

h)+B

(h
N

h) =

2khkK andbyLemma1,

2k

l1(W, W ) sup h(x)

kK

l1(W, W ) h

xS
Thus khkK

l1 (W,W)

.By admissibilityof c,

z X Y, |c(h , z) c(h, z)| |h(x)| khkK .


Therefore,
z X Y, |c(h, z) c(h, z)|

2 2 l1(W, W)

(18)

(19)
(20)
which shows thewhich shows the
distributional stability of a
kernelbased regularizationdistributional stability
algorithmforthe l1 distance.of a kernelbased
Using Lemma 2, a similarregularization
derivationleadsto
algorithm for the l2
distance.
z
X
Note that the standard

corresponding distribution,
then

|| ||K

1
l1(WU , WU ) =

m1

i=1
X

m 1 1

2
=m .

(22)

Thus,inthecaseofkernelbasedregularizedalgorithmsandforthel1distance,standard
uniformstabilityisaspecialcaseofdistributionalstability.Itcanbeshown
(

similarlythatl2

U ) =

m(m1)

EffectofEstimationErrorforKernelBasedRegularization
Algorithms

Thissectionanalyzestheeffectofanerrorintheestimationoftheweightofatrain
ingexampleonthegeneralizationerrorofthehypothesis h returnedbyaweight
sensitivelearningalgorithm.Wewillexaminetwoestimationtechniques:astraight
forwardhistogrambasedorclusterbasedmethod,andkernelmeanmatching(KMM)
(Huangetal.,2006b).
4.1

ClusterBasedEstimation

A straightforward estimate of the probability of sampling is based on the observed


empiricalfrequencies.Theratioofthenumberoftimesapoint x appearsin S andthe
number of times it appears in U is an empirical estimate of Pr[s = 1|x]. Note that
generalizationtounseenpointsxisnotneededsincereweightingrequiresonlyassigning
weightstotheseentrainingpoints.However,ingeneral,traininginstancesaretypically
uniqueorveryinfrequentsincefeaturesarerealvaluednumbers.Instead,featurescanbe
discretizedbasedonapartitioningoftheinputspaceX.Thepartitioningmaybebasedon
asimplehistogrambucketsortheresultofaclusteringtechnique.Theanalysisofthis
sectionassumessuchapriorpartitioningofX.
Weshallanalyzehowfasttheresultingempiricalfrequenciesconvergetothetruesampling
probability.ForxU,letUxdenotethesubsampleofUcontainingexactlyalltheinstancesofx

andletn=|U|andnx=|Ux|.Furthermore,letn denotethenumberofuniquepointsinthesample

U .Similarly,wedefineSx, m, mx andm fortheset S.Additionally,denotebyp0 = minxU


Pr[x] =60.
Lemma3.Let >0.Then,withprobabilityatleast1,thefollowinginequalityholdsforallx
inS:

mx
Pr[s = 1|x]
Proof.Forafixed x

Pr Pr[s = 1|x] mx

i=1

h
nx

p0 n

(23)

inequality,

U ,byHoeffding's

nx

log 2m + log

mx | | n = i
x

Pr | Pr[s = 1|x]
x

Pr[nx = i]

X
n

2i2

X
2

Pr[nx = i].

i=1

SincenxisabinomialrandomvariablewithparametersPrU[x] =pxandn,thislasttermcanbe
expressedmoreexplicitlyandboundedasfollows:
n

i=1

e2i

Pr[nx
U

= ]2

i=0

e2i

n !p
i
(1
x

px

ni

pxe2
= 2(

+ (1

X
= 2(1 px (1 e22 ))n 2 exp(pxn(1 e22 )).

px

))

Sinceforx[0,1],1e x/2,thisshowsthatfor[0,1],
Pr Pr[s = 1|x]
U

mx

nx

thedefinitionof

px n
e

(24)

p0

Bytheunionboundand

mx

Pr
U

x S : Pr[s = 1|x] nx

2mep0 n

Settingtomatchtheupperboundyieldsthestatementofthelemma.

ThefollowingpropositionboundsthedistancebetweenthedistributionWcorresponding
toaperfectlyreweightedsample(SW)andtheonecorrespondingtoasamplethat
c

isreweightedaccordingtotheobservedbias(SW ).Forasampledpointx =x,these


distributionsaredefinedasfollows:
1
1
(xi) = m p(x )
W

1
1 ,
andW(xi) = m p(x )
c
i

where,foradistinctpointxequaltothesampledpointx
i,wedefinep(xi) = Pr[s=
mx

1|x] and p(xi ) = nx .


Proposition3.LetB=

i=1,...,m

max max(1/p(xi), 1/p(xi)).Then,the l1 and l2 distances

ofthedistributionsWandWcanbeboundedasfollows,
c

s log 2
l1(W, W) B2

log 2m + log

andl2(W,W)B2s

m + log

p0n

p0nm

(26)

Proof.Bydefinitionofthe l2 distance,

)=1

l ( ,

i=1

1
p(x i )

1 m i max(p(xi)

= 1
2

p(x i )

B4

p(xi) p(xi )

p(x i )p(x i)

i=1

X
2

p(xi)) .
2

Itcanbeshownsimilarlythatl1(W,W)B maxi|p(xi)p(xi)|.Theapplication
oftheuniformconvergenceboundofLemma3directlyyieldsthestatementofthe
proposition.

Thefollowingtheoremprovidesaboundonthedifferencebetweenthegeneralization
error of the hypothesis returned by a kernelbased regularization algorithm when
trainedontheperfectlyunbiaseddistribution,andtheonetrainedonthesamplebias
correctedusingfrequencyestimates.
Theorem2. Let K beaPDSkernelsuchthat K(x, x) < forall x X .Let hW bethe
2
W
c
hypothesisreturnedbytheregularizationalgorithmbasedon N () = kk K

usingS ,andh

theonereturnedaftertrainingthesamealgorithmonSW .Then,

(25)

forany >0,withprobabilityatleast1,thedifferenceingeneralizationerrorof
thesehypothesesisboundedasfollows

log 2m
2 2 2

|R(hW) R(hW)|

+ log

(27)

p0n

log 2m + log

2 max2 (K)B2

|R(hW) R(hW)|

p0nm

Proof. The result follows from Proposition 2, the distributional stability and the bounds on the
stabilitycoefficientforkernelbasedregularizationalgorithms(Theorem1),andtheboundsonthe
l1andl2distancesbetweenthecorrectdistributionWandthe
c

estimateW.

Let n0 bethenumberofoccurrences,in U ,oftheleastfrequenttrainingexample.For


largeenoughn,p0nn0,thusthetheoremsuggeststhatthedifferenceoferror
ratebetweenthehypothesisreturnedafteranoptimalreweightingversustheonebasedq
log m

onfrequencyestimatesgoestozeroas

n0

.Inpractice,m m,thenumberof

distinctpointsinSissmall,afortiori,logm isverysmall,thus,theconvergenceratedepends
essentiallyontherateatwhich n0 increases.Additionally,if max(K) m (suchaswith
Gaussiankernels),thel2basedboundwillprovideconvergencethatisatleastasfast.

4.2

KernelMeanMatching

ThefollowingdefinitionsintroducedbySteinwart(2002)willbeneededforthepre
sentation and discussion of the kernel mean matching (KMM) technique. Let X be a
compactmetricspaceandletC(X ) denotethespaceofallcontinuousfunctionsover X
equippedwiththestandardinfinitenorm k k.Let K : X X R beaPDS kernel.
ThereexistsaHilbertspaceFandamap:XFsuchthatforallx, yX,K(x, y) =
h(x), (y)i.Notethatforagivenkernel K, F and arenotuniqueand that,forthese
definitions,FdoesnotneedtobeareproducingkernelHilbertspace(RKHS).

LetPdenotethesetofallprobabilitydistributionsoverXandlet:P Fbe
thefunctiondefinedby
p P, (p) = E [(x)].
xp

Afunctiong:XRissaidtobeinducedbyKifthereexistswFsuchthatforallx X , g(x) = hw,


(x)i. K issaidtobe universal ifitiscontinuousandifthesetof functionsinducedbyKaredenseinC(X).

Theorem3(Huangetal.(2006a)).LetFbeaseparableHilbertspaceandletKbea
universal kernel withfeaturespace F andfeaturemap : X F .Then, is
injective.
Proof.Wegiveafullproofofthemaintheoremsupportingthistechniqueintheap
pendix.TheproofgivenbyHuangetal.(2006a)doesnotseemtobecomplete.

(28)

TheKMMtechniqueisapplicablewhenthelearningalgorithmisbasedonauniversal
kernel.Thetheoremshowsthatforauniversalkernel,theexpectedvalueofthefea
turevectorsinduceduniquelydeterminestheprobabilitydistribution.KMMusesthis
propertytoreweighttrainingpointssothattheaveragevalueofthefeaturevectorsfor
thetrainingdatamatchesthat ofthefeaturevectorsforasetofunlabeledpoints
drawnfromtheunbiaseddistribution.
Let i denotetheperfectreweightingofthesample
point xi and bi theestimate

derivedbyKMM.LetB denotethelargestpossiblereweightingcoefficientandlet be
a positive realnumber. We will assume that is chosen so that 1/2. Then, the
followingistheKMMconstraintoptimization

min G() =

(x )

k
m i=1

(x )
i

i 1

X
i=1

(29)

subjecttoi[0, B]

k
n i=1

Letbethesolutionofthisoptimizationproblem,then

.For i [1, m],let i

mi=1

= i/(1 + ).The

normalizedweightsusedin
P

with

KMM'sreweightingofthesamplearethusdefinedby

i/m

initeuniversalkernelK,wedenotebyK

Rm

i=1 i

=1

X andastrictlypositivedef
P

Asintheprevioussection,givenx1, . . . , xm

= 1 + with

thekernelmatrixdefinedby

Kij = K(xi, xj ) andby min(K) > 0 thesmallesteigenvalueof K.Wealsodenote bycond(K)the


conditionnumberofthematrixK:cond(K) =max(K)/min(K).WhenKisuniversal,itiscontinuous
overthecompactXXandthusbounded,and

thereexists <suchthatK(x, x)forallxX.

Proposition4.LetKbeastrictlypositivedefiniteuniversalkernel.Then,forany >

0,withprobabilityatleast 1 ,the l2 distanceofthedistributions b /m and /m


is boundedasfollows:
1

1
m

2B

2 2

min

B2

(K)

mn

1+

.(30)

2 log
r

b
Proof.Sincetheoptimalreweightingverifiestheconstraintsoftheoptimization,by
definitionofbasaminimizer,G(b)G().Thus,bythetriangleinequality,
1
km

i=1

(31)

i(xi ) m i=1 i(xi )k G() + G() 2G().

X b
X
b
LetLdenotethelefthandsideofthisinequality:L=
definitionofthenormintheHilbertspace,

L=

1
m

i=1(i

)K(
P

i)(xi )k.By

) .Then,bythe

matrices,

standardboundsfortheRayleighquotientofPDS

min( )

2G() .
1

2
K

b
ThiscombinedwithInequality31yields
1
( )

mb

kb

) 2.

k
(32)

mkb

min2 (K)

Thus,bythetriangleinequality,
1

1
+ m k(b

m k(b
2G()
)k2
1
1
min2 (K)
m k(b

)
k2

(
3
2G()
3
1
|
min2 (K) 2 )

kk
+

b)k2

2| |B
+

S/m and hb

2G()
1

2 (K)

min

Let h be the
the r ith el.
hypothesis returned
It is
a b m bytheregularization
algorithm 2based on
follo
not
l y wh N() =kk Kusing t
h
wing i en
e
diffi
o
n
z
a
trai
hold
e
cult
r
a ned
e
s
t
to
t k on
u
r
(Le i e the
n
sho
e
d
o
r
tru
mma
a
w
f
nne
t
4,
e
e dist
usin
r
t
r
(Hua e l rib
a
g
i
r uti
n
nget
i
McD
r b on,
n
g
al., o a and
t
iarm
h
e
r
s
the
2006
s
id's
a
o e one
m
a)):
e
ineq
f d trai
a

l
t ned
g
o
r
h r on
i
y
t
e e the
h
m
h g sa
that

o
ThiscombinedwithInequality33yieldsthestatementoftheproposition.
y u mp
n
for
S
The p l le
b

any follo o a bia


/
m
wing
t
r
s
.
>
T
h
theor h i cor
e
0, em e z rec
n
,
f
with provi s a ted
o
r
prob desa i t K any > 0,
boun s i M with
a d on r o M. probabilityat
bilit the e n Theoleast 1 ,
diffe t rem
4. thedifference
Let in
y at
be
rence u a K
a
generalizatio
least betw r l strict
ly
posit n error of
1 een n g ive
defin these
the e o ite
sym hypotheses is
, gene d r metr
ic bounded as
univ

ualit

=
1
0,
|
R(h
the
)

R(
hb bo
)|
2 un
m
d
ax2
(K) be
co
me
Fo s
r

B1

m
i

mn2
+

(
K
)

ersal follows
kern

m +n1 +
co
ns
ta
nt
Pr
[s
=
Proof.
1]
C w
o hi
m ch
pa is
ri no
ng t
thi in
s cl
bo ud
un ed
d in
fo th
r e
cl
= us
0 te
wi r
th ba
th se
e d
l2 re
bo w
un ei
d gh
of ti
T ng
he .
or T
e hu
m s,
4, th
we
e cl
fir us
st te
no r
te ba
th se
at d
B co
an nv
d er

B ge
ar nc
e e
es is
se of
nti th
all e
y or
rel de
at r
ed
m
od
ul
o
th
e
r

r 2 log .

2
O(max (K)B

of
th
e
cl
u
st
er

b
as
e
d
b
o
u
n
d
is
m
or
e
fa
v
or
a
bl
e,
w
hi
le
fo
r
ot
h
er
v
al
u
es
th
e
K
M
M

b
o
u
n
d
c
o
n
v
er
g
es
fa
st
er
.

1
andtheKMMconvergenceoftheorderO(cond2(K)

Takingtheratio
expressionO

log m

p0nm

B
m

).

1
oftheformeroverthelatterandnoticingp0 O(B),weobtainthe
min(

K)B log m

.Thus,forn > min(K)Blog(m )theconvergence

ExperimentalResults

Inthissection,wewillcomparetheperformanceoftheclusterbasedreweighting
techniqueandtheKMMtechniqueempirically.Wewillfirstdiscussandanalyzethe
propertiesoftheclusteringmethodandourparticularimplementation.
TheanalysisofSection4.1dealswithdiscretepointspossiblyresultingfromtheuse
ofaquantizationorclusteringtechnique.However,duetotherelativelysmallsizeofthe
publictrainingsetsavailable,clusteringcouldleaveuswithfewclusterrepresentativesto
trainwith.Instead,inourexperiments,weonlyusedtheclusterstoestimatesampling
probabilitiesandappliedtheseweightstothefullsetoftrainingpoints.Asthefollowing
proposition shows, the l1 and l2 distance bounds of Proposition 5 do not change
significantlysolongastheclustersizeisroughlyuniformandthesamplingprobabilityis
thesameforallpointswithinacluster.Wewillrefertothisastheclusteringassumption.
Inwhat follows, let Pr[s = 1|Ci] designate the sampling probability for all x Ci.
Finally,defineq(Ci) = Pr[s= 1|Ci]andq(Ci) =|CiS|/|CiU|.

Proposition5.LetB=

max max(1/q(Ci), 1/q(Ci)).Then,the l1 and l2 distances

i=1,...,m

ofthedistributionsWandWcanbeboundedasfollows,
CM k(log 2k + log 1 )

l1 (W, W) B

| |

q 0 nm

whereq0= minq(Ci)and|CM|= maxi|Ci|.


Proof.Bydefinitionofthe l2 distance,
2
l2

(W, W) =

(W, W) B

l2

XX

i=1 xC i

1
p(x)

B4 |CM | k

p(x)

m2

2
max(q(Ci ) q(Ci ))

m2

i=1

|CM |k(log 2k + log 1 )


s

q0 nm2

i=1 xCi

XX

1
q(C i )

q(C i )

unique points. ecision regression (Breiman et al.,


The righthand side of the
tree
1984). Points with similar

first line follows from the


selects
features and labels are
Note that when
binary clusteredtogetherinthisway
cluster ing assumption and
the cluster size
the inequality then followsisuniform,then cuts onwith the assumption that
these will also have similar
from exactly the same steps|C |k = m, the
M
coordina
samplingprobabilities.
as in Proposition 5 andand the bound
tes of x
Several methods for bias
factoringawaythesumoverabove leads to
Xthatcorrection are compared in
theelementsofCi.Finally,itan expression greedily
Table1.Eachmethodassigns
is easy to see that thesimilartothatof minimiz corrective weights to the
maxi(q(Ci ) q(Ci)) termProposition5. eanodetraining samples. The
can be bounded just as in
Weusedthe impurity unweighted method uses
Lemma 3 using a uniformleaves of a mea
weight1for
everytraininginstance.Theideal
convergencebound, howeverdecision tree to sure,
1
methodusesweight
,
e.g.,
define

the
whichisoptimalbut
nowtheunionboundistaken
Pr[s=1|x]
MSE for
over the clusters rather thanclusters. A d

Table1. Normalizedmeansquarederror(NMSE)forvariousregressiondatasetsusingun
weighted,ideal,clusteredandkernelmeanmatchedtrainingsamplereweightings.
DATASET
ABALONE
BANK32NH
BANK8FM
CALHOUSING
CPUACT
CPUSMALL
HOUSING
KIN8NM
PUMA8NH

|U |
2000
4500
4499
16512
4000
4000
300
5000
4499

IDEAL
CLUSTERED
KMM
|S| ntest UNWEIGHTED
7242177 .654.019
.551.032.623.034.709.122
.610.044.635.046.691.055
23843693 .903.022
.058.001.068.002.079.013
19983693 .085.003
.360.009.375.010.595.054
95114128 .395.010
.523.080.568.018.518.237
24004192 .673.014
.477.097.408.071.531.280
23684192 .682.053
.390.053 .482.042
.469.148
116206 .509.049
.523.045 .574.018
.704.068
25103192 .594.008
.674.019 .641.012
.903.059
22463693 .685.013

requiresthesamplingdistributiontobeknown.Theclusteredmethodusesweight|Ci U |/|Ci S|,where


theclusters Ci areregressiontreeleaveswithaminimum countof4(largerclustersizesshowedsimilar,though
declining,performance).The

KMMmethodusestheapproachofHuangetal.(2006b)withaGaussiankernelandp
d

parameters=d/2forxR ,B= 1000,= 0.Notethatweknowofnoprincipledwaytodocross


validationwithKMMsinceitcannotproduceweightsfora

heldoutset(Sugiyamaetal.,2008).

TheregressiondatasetsarefromLIAAD andaresampledwithP[s= 1|x] =

1+e

wherev=4w(xx),xRdandwRdchosenatrandomfrom[1,1]d.Inour

w(xx)

experiments,wechosetenrandomprojectionswandreportedresultswiththew,for
each data set, that maximizes the difference between the unweighted and ideal
methodsoverrepeatedsamplingtrials.Inthisway,weselectedbiassamplingsthat
aregoodcandidatesforbiascorrectionestimation.
5
Forourexperiments,weusedaversionofSVRavailablefromLibSVM thatcan
takeasinputweightedsamples,withparametervaluesC= 1,and= 0.1combined
withaGaussiankernelwithparameter= d/2.Wereportresultsusingnormalized
meansquarederror(NMSE):1

ntest (y
ntest

y )

y2

deviationsfortenfoldcross

validation.

Ourresultsshowthatreweightingwithmorereliablecounts,duetoclustering,canbe
effective in the problem of sample bias correction. These results also confirm the
dependencethatourtheoreticalboundsexhibitonthequantity n0.Theresultsobtained
6
usingKMMseemtobeconsistentwiththosereportedbytheauthorsofthistechnique.

Conclusion

Wepresentedageneralanalysisofsampleselectionbiascorrectionandgavebounds
analyzingtheeffectofanestimationerrorontheaccuracyofthehypothesesreturned.
Thenotionofdistributionalstabilityandthetechniquespresentedaregeneralandcan
4
www.liaad.up.pt/ltorgo/Regression/DataSets.html.

www.csie.ntu.edu.tw/cjlin/libsvmtools.

6WethankArthurGrettonfordiscussionandhelpinclarifyingthechoiceoftheparameters
anddesignoftheKMMexperimentsreportedin(Huangetal.,2006b),andforprovidingthe
codeusedbytheauthorsforcomparisonstudies.

beofindependentinterestfortheanalysisoflearningalgorithmsinothersettings.In
particular,thesetechniquesapplysimilarlytootherimportanceweightingalgorithms
andcanbeusedinothercontextssuchthatoflearninginthepresenceofuncertain
labels. The analysis of the discriminative method of (Bickel et al., 2007) for the
problemofcovariateshiftcouldperhapsalsobenefitfromthisstudy.

References
Bickel,S.,Bruckner,M.,&Scheffer,T.(2007).Discriminativelearningfordiffering
trainingandtestdistributions.ICML2007(pp.8188).
Bousquet,O.,&Elisseeff,A.(2002).Stabilityandgeneralization.JMLR,2,499526.
Breiman,L.,Friedman,J.,Olshen,R.,&Stone,C.(1984).Classificationandregression

trees.BocaRaton,FL:CRCPress.
Cortes,C.,&Vapnik,V.N.(1995).SupportVectorNetworks. MachineLearning,
20,273297.
Devroye, L., & Wagner, T. (1979). Distributionfree performance bounds for
potentialfunctionrules.IEEETrans.onInformationTheory(pp.601604).
Dudk,M.,Schapire,R.E.,&Phillips,S.J.(2006).Correctingsampleselectionbias
inmaximumentropydensityestimation.NIPS2005.
Elkan,C.(2001).Thefoundationsofcostsensitivelearning. IJCAI (pp.973978).
Fan,W.,Davidson,I.,Zadrozny,B.,&Yu,P.S.(2005).Animprovedcategorization
ofclassifier'ssensitivityonsampleselectionbias.ICDM2005(pp.605608).
IEEEComputerSociety.
Heckman,J.J.(1979).SampleSelectionBiasasaSpecificationError.Econometrica,
47,151161.
Huang, J., Smola, A., Gretton, A., Borgwardt, K., & Scholkopf, B. (2006a).
CorrectingSampleSelectionBiasbyUnlabeledData.TechnicalReportCS2006
44).UniversityofWaterloo.
Huang,J.,Smola,A.J.,Gretton,A.,Borgwardt,K.M.,&Scholkopf,B.(2006b).
Correctingsampleselectionbiasbyunlabeleddata.NIPS2006(pp.601608).
Kearns,M.,&Ron,D.(1997).Algorithmicstabilityandsanitycheckboundsfor
leaveoneoutcrossvalidation.COLT1997(pp.152162).
Little,R.J.A.,&Rubin,D.B.(1986). Statisticalanalysiswithmissingdata.New
York,NY,USA:JohnWiley&Sons,Inc.
Saunders,C.,Gammerman,A.,&Vovk,V.(1998).RidgeRegressionLearningAlgo
rithminDualVariables.ICML1998(pp.515521).
Steinwart,I.(2002).Ontheinfluenceofthekernelontheconsistencyofsupport
vectormachines.JMLR,2,6793.
Sugiyama,M.,Nakajima,S.,Kashima,H.,vonBunau,P.,&Kawanabe,M.(2008).
Directimportanceestimationwithmodelselectionanditsapplicationtocovariate
shiftadaptation.NIPS2008.
Vapnik, V. N. (1998). Statistical learning theory. New York: WileyInterscience.
Zadrozny,B.(2004).Learningandevaluatingclassifiersundersampleselectionbias.
ICML2004.
Zadrozny, B., Langford, J., & Abe, N. (2003). Costsensitive learning by cost
proportionateexampleweighting.ICDM2003.

ProofofTheorem3

Proof.Assumethat(p) =(q)fortwoprobabilitydistributionspandqinP.Itis
knownthatifExp[f(x)] = Exq[f(x)]foranyfC(X),thenp=q.LetfC(X)
andfix
> 0.Since K isuniversal,thereexistsafunction g inducedby K suchthat
kf gk . Exp[f (x)] Exq [f (x)] canberewrittenas
xEp[f

E [f (x)]

xp
Since

Exp |f (x) g(x)| kf gk andsimilarly

Since Exp[f (x) g(x)]


Ex q [f (x)
g(x)] ,

(36)

(x) g(x)] + xEp[g(x)] xEq[g(x)] + xEq[g(x) f (x)].

isinduced

by

E [f (x)]

xq

E [g(x)]

xp

,thereexistsw

suchthatforallx

xq

(37)

+ 2.

E [g(x)]

,g(x) =

w, (x) .

letwn=hw, eniandn(x) =h(x), eni.


N

N,considerthepartialsum

Then,
N

gN (x) =

inequality,

n=0

n=0

wn n(x)

).BytheCauchySchwarz

wnn (x

|gN (x)| k wnenk2

n=0

.Foreach

g(x) =

SinceFisseparable,itadmitsacountableorthonormalbasis(en)nN.FornN,

1/2

k n (x)en k2

1/2

kwk2

1/2

1/2

k(x)k2

(38)

n=0

X
X
SinceKisuniversal,itiscontinuousandthusisalsocontinuous(Steinwart,2002).
Thusx7k(x)k2isacontinuousfunctionoverthecompactXandadmitsanupper
boundB0.Thus,|gN(x)|
kwk2B.Theintegral
kwk2B dp isclearlywell
definedandequals
w
p
R p
2B.
k k
holds:
Thus,bytheLebesguedominatedconvergencetheorem,

thefollowing

E [g(x)] =

n=0

w (x)dp(x) =

xp

n=0

n(x)dp(x).

(39)

n n

BydefinitionofEx p[(x)],thelasttermistheinnerproductofwandthatterm.Thus,

E [g(x)] =

xp

w, xp (x)

(40)

= h w, (p)

i.

Asimilarequalityholdswiththedistributionq,thus,
xEp[g(x)]

xEq[g(x)] = hw, (p) (q)i = 0.

Thus,Inequality37canberewrittenas
xp

forall >0.Thisimplies
injectivityof.

E [f (x)]

xq

(41)

E [f (x)] 2,

Ex p[f (x)] = Exq

[f (x)] forall

C(X ) andthe

Das könnte Ihnen auch gefallen