Beruflich Dokumente
Kultur Dokumente
MachineLearning,TueApr10,2007
Motivation
Problem:Toavoidrelyingonmagicweneed mathematics.Formachinelearningweneedto quantify: Uncertaintyindatameasuresandconclusions Goodnessofmodel(whenconfrontedwithdata) Expectederrorandexpectedsuccessrates ...andmanysimilarquantities...
Motivation
Problem:Toavoidrelyingonmagicweneed mathematics.Formachinelearningweneedto quantify: Uncertaintyindatameasuresandconclusions Goodnessofmodel(whenconfrontedwithdata) Expectederrorandexpectedsuccessrates ...andmanysimilarquantities... Probabilitytheory:Mathematicalmodelingwhen uncertaintyorrandomnessispresent.
P X = x i , Y = y j = pij
Motivation
Problem:Toavoidrelyingonmagicweneed mathematics.Formachinelearningweneedto quantify: Uncertaintyindatameasuresandconclusions nij P X = x i , Y = y j = Goodnessofmodel(whenconfrontedwithdata) n Expectederrorandexpectedsuccessrates ...andmanysimilarquantities... Probabilitytheory:Mathematicalmodelingwhen uncertaintyorrandomnessispresent. Statistics:Themathematicsofcollectionofdata, descriptionofdata,andinferencefromdata
Introductiontoprobabilitytheory
Notice:Thiswillbeaninformalintroductionto probabilitytheory(measuretheoryoutofscopefor thiscourse).Nosigmaalgebras,Borelsets,etc. Forthepurposeofthisclass,ourintuitionwillberight ...inmorecomplexsettingsitcanbeverywrong. Weleavethecomplexsetupstothemathematicians andsticktonicemodels.
Introductiontoprobabilitytheory
Notice:Thiswillbeaninformalintroductionto probabilitytheory(measuretheoryoutofscopefor thiscourse).Nosigmaalgebras,Borelsets,etc. Thisintroductionwillbebasedonstochastic Forthepurposeofthisclass,ourintuitionwillberight (random)variables. ...inmorecomplexsettingsitcanbe verywrong. Weleavethecomplexsetupstothemathematicians andsticktonicemodels.
Introductiontoprobabilitytheory
Notice:Thiswillbeaninformalintroductionto probabilitytheory(measuretheoryoutofscopefor thiscourse).Nosigmaalgebras,Borelsets,etc. Thisintroductionwillbebasedonstochastic Forthepurposeofthisclass,ourintuitionwillberight (random)variables. ...inmorecomplexsettingsitcanbe verywrong. Weignoretheunderlyingprobabilityspace (W,A,p). Weleavethecomplexsetupstothemathematicians andsticktonicemodels.
Introductiontoprobabilitytheory
Notice:Thiswillbeaninformalintroductionto probabilitytheory(measuretheoryoutofscopefor thiscourse).Nosigmaalgebras,Borelsets,etc. Thisintroductionwillbebasedonstochastic Forthepurposeofthisclass,ourintuitionwillberight (random)variables. ...inmorecomplexsettingsitcanbe verywrong. Weignoretheunderlyingprobabilityspace (W,A,p). Weleavethecomplexsetupstothemathematicians andsticktonicemodels. If Xisthesumoftwodice: X(w) = D1(w) + D2(w)
Introductiontoprobabilitytheory
Notice:Thiswillbeaninformalintroductionto probabilitytheory(measuretheoryoutofscopefor thiscourse).Nosigmaalgebras,Borelsets,etc. Thisintroductionwillbebasedonstochastic Forthepurposeofthisclass,ourintuitionwillberight (random)variables. ...inmorecomplexsettingsitcanbe verywrong. Weignoretheunderlyingprobabilityspace (W,A,p). Weleavethecomplexsetupstothemathematicians andsticktonicemodels. If Xisthesumoftwodice: X(w) = D1(w) + D2(w) Weignorethediceandonly considerthevariables X,
Discreterandomvariables
Adiscreterandomvariable, X,isavariablethatcantake valuesinadiscrete(countable)set{ xi}.
Sect.1.2
Discreterandomvariables
Adiscreterandomvariable, X,isavariablethatcantake valuesinadiscrete(countable)set{ xi}.
andsatisfies p(X=xi) 0 forall i, i p(X=xi) = 1,andforany Intuition/interpretation:Ifwerepeatanexperiment(sampling avaluefor X) ntimes,anddenoteby nithenumberoftimes weobserve X=xi,then ni/n p(X=xi)as n.
Sect.1.2
Discreterandomvariables
Adiscreterandomvariable, X,isavariablethatcantake valuesinadiscrete(countable)set{ xi}.
Theprobability of Xtakingthevalue xiisdenoted p(X=xi) Thisisthe intuitionnotadefinition! (Definitionsbasedonthisendsupgoingincircles). andsatisfies p( X=xi) 0 forall i, i p(X=xi) = 1,andforany subset{ xj} { xi}: p(X{xj}) = jp(xj). Thedefinitionsarepureabstractmath.Anyreal worldusefulnessispureluck. Intuition/interpretation: Ifwerepeatanexperiment(sampling avaluefor X) ntimes,anddenoteby nithenumberoftimes weobserve X=xi,then ni/n p(X=xi)as n.
Sect.1.2
Discreterandomvariables
Adiscreterandomvariable, X,isavariablethatcantake valuesinadiscrete(countable)set{ xi}.
Theprobability of Xtakingthevalue xiisdenoted p(X=xi) Thisisthe intuitionnotadefinition! (Definitionsbasedonthisendsupgoingincircles). andsatisfies p( X=xi) 0 forall i, i p(X=xi) = 1,andforany subset{ xj } { xi}: p(X{xj}) = jp(xj). Weoftensimplifythenotationanduseboth p(X) Thedefinitionsarepureabstractmath.Anyreal worldusefulnessispureluck. and p (xi)for p(X =xi),dependingoncontext. Intuition/interpretation: Ifwerepeatanexperiment(sampling avaluefor X) ntimes,anddenoteby nithenumberoftimes weobserve X=xi,then ni/n p(X=xi)as n.
Sect.1.2
Jointprobability
Ifarandomvariable, Z,isavector, Z=(X,Y),wecan consideritscomponentsseparetly. Theprobability p(Z=z)where z = (x,y)isthejointprobability of X=xand Y=ywritten p(X=x,Y=y)or p(x,y).
Marginalprobability
Theprobabilityof X=xiregardlessofthevalueof Ythen of X andiswrittenjust p(xi).
becomes j p(xi,yj)andisdenotedthemarginalprobability
Thesumrule: (1.10)
Sect.1.2
Conditionalprobability
Theconditionalprobabilityof XgivenYiswritten P(X|Y) andisthequantitysatisfying p(X,Y) = p(X|Y)p(Y). Theproductrule: (1.11) When p(Y) 0weget p(X|Y) = p(X,Y) / p(Y)withasimple interpretation.
Sect.1.2
Conditionalprobability
andisthequantitysatisfying p(X,Y) = p(X|Y)p(Y). Intuition:Beforeweobserveanything,theprobabilityof Xis p(XTheproductrule: )butafterweobserve Yitbecomes p(X|Y). Theconditionalprobabilityof XgivenYiswritten P(X|Y)
Sect.1.2
Independence
When p(X,Y) = p(X)p(Y)wesaythat XandYare independent. Inthiscase:
Sect.1.2
Example
Bcolourofbucket F kindoffruit
Sect.1.2
Example
Bcolourofbucket F kindoffruit
p(F=a,B=r) = p(F=a|B=r)p(B=r) = 2/8 4/10 = 1/10 p(F=a,B=b) = p(F=a|B=b)p(B=b) = 2/8 6/10 = 9/20 Sect.1.2
Example
Bcolourofbucket F kindoffruit
p(F=a) = p(F=a,B=r) + p(F=a,B=b) = 1/10 + 9/10 = 11/20 p(F=a,B=r) = p(F=a|B=r)p(B=r) = 2/8 4/10 = 1/10 p(F=a,B=b) = p(F=a|B=b)p(B=b) = 2/8 6/10 = 9/20 Sect.1.2
Bayes'theorem
Since p(X,Y) = p(Y,X)(symmetry)and p(X,Y) = p(Y|X)p(X) (productrule)itfollows p(Y|X)p(X) = p(X|Y)p(Y)or,when p(X) 0:
Bayes'theorem:
(1.12)
p(X) =Y p(X|Y)p(Y)isanimplicitnormalisingfactor.
Sect.1.2
Bayes'theorem
Since p(X,Y) = p(Y,X)(symmetry)and p(X,Y) = p(Y|X)p(X) (productrule)itfollows p(Y|X)p(X) = p(X|Y)p(Y)or,when p(X) 0:
(1.12)
Bayes'theorem
(productrule)itfollows p(Y|X)p(X) = p(X|Y)p(Y)or,when Interpretation: p(X) 0: Priortoanexperiment,theprobabilityof Yis p(Y) Afterobserving X,theprobabilityis p(Y|X) Bayes'theorem: Since p(X,Y) = p(Y,X)(symmetry)and p(X,Y) = p(Y|X)p(X)
(1.12)
p(X) =Y p(X|Y)p(Y)isanimplicitnormalisingfactor.
Sect.1.2
Bayes'theorem
(productrule)itfollows p(Y|X)p(X) = p(X|Y)p(Y)or,when Interpretation: p(X) 0: Priortoanexperiment,theprobabilityof Yis p(Y) Since p(X,Y) = p(Y,X)(symmetry)and p(X,Y) = p(Y|X)p(X)
Afterobserving X,theprobabilityis p(Y|X) Thisispossiblythemostimportantequation Bayes'theorem: intheentireclass! (1.12) Bayes'theoremtellsushowtomovefrompriortoposterior. Sometimeswritten: p(Y|X) p(X|Y)p(Y)where
p(X) =Y p(X|Y)p(Y)isanimplicitnormalisingfactor.
Sect.1.2
Example
Bcolourofbucket F kindoffruit Ifwedrawanoragne,whatistheprobabilitywedrewitfrom thebluebasket?
Sect.1.2
Continuousrandomvariables
Acontinuousrandomvariable, X,isavariablethatcan takevaluesin Rd.
Sect.1.2.1
Expectation
Theexpectationormeanofafunction fofrandomvariable Xisaweightedaverage
Expectation
Intuition:Ifyourepeatedlyplayagamewithgain f(x),your expectedoverallgainafter ngameswillbe n E[f]. Theaccuracyofthispredictionincreaseswith n.
Itmightnotevenbepossibletogain E[f]inasinglegame.
Sect.1.2.2
Expectation
Intuition:Ifyourepeatedlyplayagamewithgain f(x),your expectedoverallgainafter ngameswillbe n E[f]. Theaccuracyofthispredictionincreaseswith n.
Sect.1.2.2
Variance
Thevarianceof f(x)isdefinedas andcanbeseenasameasureofvariabilityaroundthemean. Thecovarianceof Xand Yisdefinedas andmeasuresthevariabilityofthetwovariablestogether.
Sect.1.2.2
Variance
Thevarianceof f(x)isdefinedas andcanbeseenasameasureofvariabilityaroundthemean. Thecovarianceof Xand Yisdefinedas andmeasuresthevariabilityofthetwovariablestogether. Whencov[x,y] > 0,whenxisabovemean, ytendstobe. Whencov [x,y]<0,whenxisabovemean, ytendstobebelow. Whencov [x,y]=0, xand yareuncorrelated(notnecessarily independent;independeceimpliesuncorrelated,though).
Sect.1.2.2
Covariance
cov [x1,x2]>0
cov [x1,x2]=0
Sect.1.2.2
Parameterizeddistributions
Manydistributionsare governedbyafew parameters. E.g.cointossing (Bernoullydistribution) governedbythe probabilityofheads. Binomialdistribution: numberofheads kout of ncointosses:
Parameterizeddistributions
Manydistributionsare Wecanthinkofaparameterizeddistributionasa governedbyafew conditionaldistribution. parameters. Thefunction x p ( x | q ) isthe probability of E.g.cointossing observation xgivenparameter q. (Bernoullydistribution) governedbythe probabilityofheads. Thefunction q p(x | q)isthelikelihoodof
Binomialdistribution: parameter qgivenobservation x.Sometimeswritten numberofheads kout lhd(q | x) = p(x | q). of ncointosses:
Parameterizeddistributions
Manydistributionsare Wecanthinkofaparameterizeddistributionasa governedbyafew conditionaldistribution. parameters. Thefunction x p ( x | q ) isthe probability of E.g.cointossing observation xgivenparameter q. (Bernoullydistribution) governedbythe probabilityofheads. Thefunction q p(x | q)isthelikelihoodof
Binomialdistribution: parameter qgivenobservation x.Sometimeswritten numberofheads kout lhd(q | x) = p(x | q). of ncointosses: Thelikelihood,ingeneral,isnotaprobability distribution.
Parameterestimation
Generally,parametersarenotknowbutmostbeestimatedfrom observeddata. MaximumLikelihood(ML): MaximumAPosteriori(MAP): (ABayesianapproachassuming adistributionoverparameters). FullyBayesian: (Estimatesadistributionrather thanaparameter).
Parameterestimation
Example:Wetossacoinandgetahead.Ourmodelisa binomialdistribution; xisoneheadand qtheprobabilityofa head. Likelihood:
Prior:
Posterior:
Parameterestimation
Example:Wetossacoinandgetahead.Ourmodelisa binomialdistribution; xisoneheadand qtheprobabilityofa head. Likelihood: MLestimate Prior:
Posterior:
Parameterestimation
Example:Wetossacoinandgetahead.Ourmodelisa binomialdistribution; xisoneheadand qtheprobabilityofa head. Likelihood:
Prior:
Posterior:
MAPestimate
Parameterestimation
Example:Wetossacoinandgetahead.Ourmodelisa binomialdistribution; xisoneheadand qtheprobabilityofa head. FullyBayesianapproach: Likelihood:
Prior:
Posterior:
Predictions
Assumenowknownjointdistribution p(x,t | q)ofexplanatory p(t | x, q)tomakepredictionsabout t. variable xandtargetvariable t.Whenobservingnew xwecanuse
Decisiontheory
Basedon p(x,t | q)weoftenneedtomakedecisions. Thisoftenmeanstakingoneofasmallsetofactions A1,A2,...,Ak basedonobserved x.
Assumethatthetargetvariableisinthisset,thenwemake decisionsbasedon p(t | x, q) = p( Ai | x, q). Putinadifferentway:weuse p(x,t | q)toclassify xintooneof k classes, Ci.
Sect.1.5
Decisiontheory
Wecanapproachthisbysplittingtheinputintoregions, Ri,and makedecisionsbasedonthese: In R1gofor C1;in R2gofor C2. Chooseregionstominimize classificationerrors:
Sect.1.5
Decisiontheory
Wecanapproachthisbysplittingtheinputintoregions, Ri,and makedecisionsbasedonthese: In R1gofor C1;in R2gofor C2. Chooseregionstominimize classificationerrors:
Decisiontheory
Wecanapproachthisbysplittingtheinputintoregions, Ri,and makedecisionsbasedonthese: In R1gofor C1;in R2gofor C2. Chooseregionstominimize classificationerrors: x0iswhere p(x, C1) = p(x,C2)orsimilarly
Sect.1.5
Modelselection
Wheredoweget p(t,x | q)frominthefirstplace?
Sect.1.3
Modelselection
Wheredoweget p(t,x | q)frominthefirstplace? Thereisnorightmodelafaircoinorfairdiceisasunrealisticas asphericalcow!
Sect.1.3
Modelselection
Wheredoweget p(t,x | q)frominthefirstplace? Thereisnorightmodelafaircoinorfairdiceisasunrealisticas asphericalcow! Sometimesthereareobviouscandidatestotryeitherforthejoint orconditionalprobabilities p(x,t | q)or p(t | x, q). Sometimeswecantrya"generic"modellinearmodels,neural networks,...
Sect.1.3
Modelselection
Wheredoweget p(t,x | q)frominthefirstplace? Thereisnorightmodelafaircoinorfairdiceisasunrealisticas asphericalcow! Sometimesthereareobviouscandidatestotryeitherforthejoint orconditionalprobabilities p(x,t | q)or p(t | x, q). Sometimeswecantrya"generic"modellinearmodels,neural networks,... Thisisthetopicofmostofthisclass! Sect.1.3
Modelselection
Wheredoweget p(t,x | q)frominthefirstplace? Thereisnorightmodelafaircoinorfairdiceisasunrealisticas asphericalcow!
Sect.1.3
Modelselection
Wheredoweget p(t,x | q)frominthefirstplace? Thereisnorightmodelafaircoinorfairdiceisasunrealisticas asphericalcow! Butsomemodelsaremoreusefulthanothers.
Sect.1.3
Modelselection
Wheredoweget p(t,x | q)frominthefirstplace? Thereisnorightmodelafaircoinorfairdiceisasunrealisticas asphericalcow! Butsomemodelsaremoreusefulthanothers. Ifwehaveseveralmodels,howdowemeasuretheusefulnessof each?
Sect.1.3
Modelselection
Wheredoweget p(t,x | q)frominthefirstplace? Thereisnorightmodelafaircoinorfairdiceisasunrealisticas asphericalcow! Butsomemodelsaremoreusefulthanothers. Ifwehaveseveralmodels,howdowemeasuretheusefulnessof each? Agoodmeasureispredictionaccuracyonnewdata. Sect.1.3
Modelselection
Ifwecomparetwomodels,wecantakeamaximumlikelihood approach:
oraBayesianapproach:
justasforparameters.
Sect.1.3
Modelselection
Ifwecomparetwomodels,wecantakeamaximumlikelihood approach: Butthereisanoverfittingproblem: Complexmodelsoftenfittraining databetterwithoutgeneralizing oraBayesianapproach: better! justasforparameters.
Sect.1.3
Modelselection
Ifwecomparetwomodels,wecantakeamaximumlikelihood approach: Butthereisanoverfittingproblem: Complexmodelsoftenfittraining databetterwithoutgeneralizing oraBayesianapproach: better! InBayesianapproach,use p(M)topenalize complexmodels justasforparameters. InMLapproach,usesomeInformationCriteria andmaximize ln p(t,x |M) penalty( M). Sect.1.3
Modelselection
Ifwecomparetwomodels,wecantakeamaximumlikelihood approach: Butthereisanoverfittingproblem: Complexmodelsoftenfittraining Ormoreempiricalapproach:Use databetterwithoutgeneralizing oraBayesianapproach: somemethodofsplittingdatainto better! trainingdataandtestdataandpick modelthatperformsbestontestdata. justasforparameters. (andretrainthatmodelwiththefull dataset).
Sect.1.3
Summary
Probabilities Stochasticvariables Marginalandconditional probabilities Bayes'theorem Expectation,varianceand covariance Estimation Decisiontheoryand modelselection