You are on page 1of 8

KernelDensityEstimation(KDE)Tutorial 1 SpiderFinancialCorp,2013

KernelDensityEstimation(KDE)
Previously,weveseenhowtousethehistogrammethodtoinfertheprobabilitydensityfunction(PDF)
ofarandomvariable(population)usingafinitedatasample.Inthistutorial,wellcarryontheproblem
ofprobabilitydensityfunctioninference,butusinganothermethod:Kerneldensityestimation.
Kerneldensityestimates(KDE)arecloselyrelatedtohistograms,butcanbeendowedwithproperties
suchassmoothnessorcontinuitybyusingasuitablekernel.
Whydowecare?
Oneofthemainproblemsinpracticalapplicationsisthattheneededprobabilitydistributionisusually
notreadilyavailable,butratheritmustbederivedfromotherexistinginformation(e.g.sampledata).
KDEsaresimilartohistogramsintermsofbeingnonparametricmethod,sotherearenorestrictive
assumptionsabouttheshapeofthedensityfunction,butKDEisfarmoresuperiortohistogramsasfar
asaccuracyandcontinuity.
Overview
Letsconsiderafinitedatasample
1 2
{x ,x ,..., }
N
x observedfromastochastic(i.e.continuousand
random)process.Wewishtoinferthepopulationprobabilitydensityfunction.
Inthehistogrammethod,weselecttheleftboundofthehistogram(
o
x ),thebinswidth( h),andthen
computethebin k probabilityestimator

( )
h
f k :
- Bin k representsthefollowinginterval [ ( 1) , )
o o
x k h x k h + +
-
1
{( 1) x x }

( )
N
i o
i
h
I k h k h
f k
N
=
s <
=

- {} I isaneventfunctionthatreturns1(one)iftheconditionistrue,0(zero)otherwise.
Thechoiceofbins,especiallythebinwidth( h ),hasasubstantialeffectontheshapeandother
propertiesof

( )
h
f k .
Finally,wecanthinkofthehistogrammethodasfollows:
- Eachobservation(event)isstatisticallyindependentofallothers,anditsoccurrenceprobability
isequalto
1
N
.
-

( )
h
f k issimplytheintegral(sum)oftheeventprobabilitiesineachbin.

KernelDensityEstimation(KDE)Tutorial 2 SpiderFinancialCorp,2013

Whatisakernel?
Akernelisanonnegative,realvalued,integrablefunction (.) K satisfyingthefollowingtwo
requirements:

( ) 1
( ) ( )
K u du
K u K u

=
=
}

And,asaresult,thescaledfunction
*
( ) K u ,where
*
( ) ( ) K u K u = ,isakernelaswell.
Now,placeascaledkernelfunctionateachobservationinthesampleandcomputethenewprobability
estimators

( )
h
f x foravalue x (comparedtoanearlierbininthehistogram).

1
1
1

( ) ( )
1 u
(u) ( )
1

( ) ( )
N
h h i
i
h
N
i
h
i
f x K x x
N
K K
h h
x x
f x K
N h h
=
=
=
=


Asanexample,let (.) K bethestandardizedGaussiandensityfunction.TheKDElookslikethesumof
Gaussiancurves,eachcenteredononeobservation.

Note:ForGaussiankernel,thebandwidth h isthesameasthestandarddeviationof(
i
x x ).

KernelDensityEstimation(KDE)Tutorial 3 SpiderFinancialCorp,2013

TheKDEmethodreplacesthediscreteprobability
1 2
1 2
1/ { , ,.., }
(x)
0 { , ,.., }
N
N
N x x x x
P
x x x x
e
=

e

withakernel
function.Thispermitsoverlapbetweenkernels,thuspromotingcontinuityintheprobabilityestimator.
WhyKDE?
Duetoourdatasampling,weareleftwithafinitesetofvaluesforcontinuousrandomvariables.Usinga
kernelinsteadofdiscreteprobabilities,wepromotethecontinuitynatureintheunderlyingrandom
variable.
ToproceedwithKDE,youllneedtodecideontwokeyparameters:Kernelfunctionandbandwidth.
WhichkernelshouldIuse?
Arangeofkernelfunctionsarecommonlyused:uniform,triangular,biweight,triweightand
Epanechnikov.TheGaussiankernelisoftenused; (.) (.) K | = ,where | isthestandardnormaldensity
function.
HowdoIproperlycomputekernelbandwidth?
Intuitively,onewantstochooseanhassmallasthedataallows,butthereisatradeoffbetweenthe
biasoftheestimatoranditsvariance.
Selectionofthebandwidthofakernelestimatorisasubjectofconsiderableresearch.Wewilloutline
twopopularmethods:
1. SubjectiveselectionOnecanexperimentbyusingdifferentbandwidthsandsimplyselecting
onethatlooksrightforthetypeofdataunderinvestigation.
2. SelectionwithreferencetosomegivendistributionHereoneselectsthebandwidththatwould
beoptimalforaparticularPDF.Keepinmindthatyouarenotassumingthat ( ) f x isnormal,but
ratherselectingan hwhichwouldbeoptimalifthePDFwerenormal.UsingaGaussiankernel,
theoptimalbandwidth
opt
h isdefinedasfollows:

5
4
3
opt
h
N
o =
Thenormaldistributionisnotawigglydistribution;itisunimodalandbellshaped.Itis
thereforetobeexpectedthat
opt
h willbetoolargeformultimodaldistributions.Furthermore,
thesamplevariance(
2
s )isnotarobustestimatorof
2
o ;itoverestimatesifsomeoutliers
(extremeobservations)arepresent.Toovercometheseproblems,Silvermanproposedthe
followingbandwidthestimator:

KernelDensityEstimation(KDE)Tutorial 4 SpiderFinancialCorp,2013

5
3 1
0.9
min(s, )
1.34
R IQR Q
opt
h
N
R
Q
o
o

=
=
= =

Where IQRistheinterquartilerangeand sisthesamplestandarddeviation.


3. Datadrivenestimationthisisanareaofcurrentresearchusingseveraldifferentmethods:
Fouriertransform,diffusionbased,etc.
Process
UsingtheNumXLaddinforExcel,youcancomputetheKDEvaluesfordifferentkernelfunctions(e.g.
Gaussian,uniform,triangular,etc.)and(optionally)withabandwidthvalue.
Foroursampledata,weareusing50randomlygeneratedvaluesofthenormaldistribution(usingthe
randomgeneratorintheExcelAnalysisPack).Weplottedthehistogramforourreference:

NowwearereadytoconstructourKDEplot.First,selecttheemptycellinyourworksheetwhereyou
wishtheoutputtabletobegenerated,thenlocateandclickontheDescriptiveStatisticsiconinthe
NumXLtab(ortoolbar).Then,selecttheKerneldensityestimationitemfromthedropdownmenu.

KernelDensityEstimation(KDE)Tutorial 5 SpiderFinancialCorp,2013

TheKDEwizardappears.

Selectthecellsrangeforthevaluesoftheinputvariable.
Notes:
1. Thecellsrangeincludes(optional)theheading(Label)cell,whichwouldbeusedintheoutput
tableswhereitreferencesthosevariables.
2. Bydefault,theoutputtablecellsrangeissettothecurrentselectedcellinyourworksheet.
3. Bydefault,theoutputgraphcellsrangeissettothe7cellstotherightofthecurrentlyselected
cellinyourworksheet.
Finally,onceweselecttheinputdata(X)cellsrange,theOptionsandMissingValuestabsbecome
available(enabled).
Next,selecttheOptionstab:

KernelDensityEstimation(KDE)Tutorial 6 SpiderFinancialCorp,2013

Notes:
1. Bydefault,theGaussiankernelfunctionisselected.Letsleavethisoptionunchanged.
2. Bydefault,theoptimalbandwidthoptionischecked.TheKDEfunctionwillusetheSilverman
estimateforthebandwidth.Leaveitchecked.
3. Bydefault,theoutputtablesizeissetto5.Leaveitunchanged.
4. OverlayNormaldistributionischecked.Thisoptionineffectinstructsthewizardtogeneratea
secondcurvefortheGaussiandistributionforcomparisonpurposes.Leavethisoptionchecked.
Now,clickontheMissingValuestab.

Inthistab,youcanselectanapproachtohandlemissingvaluesinthedataset(Xs).Bydefault,any
observationwithmissingvaluewouldbeexcludedfromtheanalysis.
Thistreatmentisagoodapproachforouranalysis,soletsleaveitunchanged.
Now,clickOKtogeneratetheoutputtables.

Notes:
1. ThevaluesofallXaresortedinascendingorder.

KernelDensityEstimation(KDE)Tutorial 7 SpiderFinancialCorp,2013

2. Thesummarystatisticsinthe1
st
rowarecomputedmerelytofacilitatethecreationofthetable
orcomputingtheoverlayGaussiandistributionfunction.
ThegeneratedplotoftheKDEisshownbelow:

NotethattheKDEcurve(blue)tracksverycloselywiththeGaussiandensity(orange)curve.
Case2
Nowletstryanonnormalsampledataset.Wegenerated50randomvaluesofauniform
distributionbetween3and3.Followingsimilarsteps,weplottedthehistogramandtheKDE:

NotethattheKDEcurve(blue)tracksmuchmorecloselywiththeunderlyingdistribution(i.e.
uniform)thanthehistogram.
Case3
Forour3
rd
case,wegenerated50randomvaluesofabinomialdistribution(p=0.2andbatch
size=20).Followingsimilarsteps,weplottedthehistogramandtheKDE.

KernelDensityEstimation(KDE)Tutorial 8 SpiderFinancialCorp,2013


NotethatKDEcurve(blue)tracksmuchmorecloselywiththeunderlyingdistribution(i.e.uniform)than
thehistogram.
Conclusion
Inthistutorial,wedemonstratedtheprocesstogenerateakerneldensityestimationinExcelusing
NumXLsaddinfunctions.
TheKDEmethodisamajorimprovementforinferringtheprobabilitydensityfunctionofthepopulation,
intermsofaccuracyandcontinuityofthefunction.Nevertheless,itintroduceanewchallenge:
selectingaproperbandwidth.Inthemajorityofcases,theSilvermanestimatorforthebandwidth
provestobesatisfactory,butisitoptimal?Dowecare?
Wheredowegofromhere?
First,toanswerthequestionofoptimality,weneedtointroduceadditionalalgorithmstoestimateits
values.Forexample,inAnnalsofStatistics,Volume38,Number5,pages29162957,Z.I.Botev,J.F.
Grotowski,andD.P.Kroesedescribedanumericalsampledatadrivenmethodforfindingtheoptimal
bandwidthusingaKerneldensityestimationviathediffusionapproach.
Second,incaseswheretherangeofvaluesthattherandomnumbercantakeareknowntobe
constrainedfromoneside(e.g.prices,binomialdata,etc.),orinarange(e.g.survivalrate,defaultrate,
etc.),thenhowdoweadapttheKDEtofactorinthoseconstraints?
Finally,wedefinedtheKDEprobabilityestimatorusingafixedbandwidth( h)forallobservations.Ifthe
bandwidthisnotheldfixed,butisvarieddependinguponthelocationofeithertheestimate(balloon
estimator)orthesamples(pointwiseestimator),thisproducesaparticularlypowerfulmethodknown
asadaptiveorvariablebandwidthkerneldensityestimation.