Sie sind auf Seite 1von 59

Crashcourseinprobabilitytheoryand statisticspart1

MachineLearning,TueApr10,2007

Motivation
Problem:Toavoidrelyingonmagicweneed mathematics.Formachinelearningweneedto quantify: Uncertaintyindatameasuresandconclusions Goodnessofmodel(whenconfrontedwithdata) Expectederrorandexpectedsuccessrates ...andmanysimilarquantities...

Motivation
Problem:Toavoidrelyingonmagicweneed mathematics.Formachinelearningweneedto quantify: Uncertaintyindatameasuresandconclusions Goodnessofmodel(whenconfrontedwithdata) Expectederrorandexpectedsuccessrates ...andmanysimilarquantities... Probabilitytheory:Mathematicalmodelingwhen uncertaintyorrandomnessispresent.

P X = x i , Y = y j = pij

Motivation
Problem:Toavoidrelyingonmagicweneed mathematics.Formachinelearningweneedto quantify: Uncertaintyindatameasuresandconclusions nij P X = x i , Y = y j = Goodnessofmodel(whenconfrontedwithdata) n Expectederrorandexpectedsuccessrates ...andmanysimilarquantities... Probabilitytheory:Mathematicalmodelingwhen uncertaintyorrandomnessispresent. Statistics:Themathematicsofcollectionofdata, descriptionofdata,andinferencefromdata

Introductiontoprobabilitytheory
Notice:Thiswillbeaninformalintroductionto probabilitytheory(measuretheoryoutofscopefor thiscourse).Nosigmaalgebras,Borelsets,etc. Forthepurposeofthisclass,ourintuitionwillberight ...inmorecomplexsettingsitcanbeverywrong. Weleavethecomplexsetupstothemathematicians andsticktonicemodels.

Introductiontoprobabilitytheory
Notice:Thiswillbeaninformalintroductionto probabilitytheory(measuretheoryoutofscopefor thiscourse).Nosigmaalgebras,Borelsets,etc. Thisintroductionwillbebasedonstochastic Forthepurposeofthisclass,ourintuitionwillberight (random)variables. ...inmorecomplexsettingsitcanbe verywrong. Weleavethecomplexsetupstothemathematicians andsticktonicemodels.

Introductiontoprobabilitytheory
Notice:Thiswillbeaninformalintroductionto probabilitytheory(measuretheoryoutofscopefor thiscourse).Nosigmaalgebras,Borelsets,etc. Thisintroductionwillbebasedonstochastic Forthepurposeofthisclass,ourintuitionwillberight (random)variables. ...inmorecomplexsettingsitcanbe verywrong. Weignoretheunderlyingprobabilityspace (W,A,p). Weleavethecomplexsetupstothemathematicians andsticktonicemodels.

Introductiontoprobabilitytheory
Notice:Thiswillbeaninformalintroductionto probabilitytheory(measuretheoryoutofscopefor thiscourse).Nosigmaalgebras,Borelsets,etc. Thisintroductionwillbebasedonstochastic Forthepurposeofthisclass,ourintuitionwillberight (random)variables. ...inmorecomplexsettingsitcanbe verywrong. Weignoretheunderlyingprobabilityspace (W,A,p). Weleavethecomplexsetupstothemathematicians andsticktonicemodels. If Xisthesumoftwodice: X(w) = D1(w) + D2(w)

Introductiontoprobabilitytheory
Notice:Thiswillbeaninformalintroductionto probabilitytheory(measuretheoryoutofscopefor thiscourse).Nosigmaalgebras,Borelsets,etc. Thisintroductionwillbebasedonstochastic Forthepurposeofthisclass,ourintuitionwillberight (random)variables. ...inmorecomplexsettingsitcanbe verywrong. Weignoretheunderlyingprobabilityspace (W,A,p). Weleavethecomplexsetupstothemathematicians andsticktonicemodels. If Xisthesumoftwodice: X(w) = D1(w) + D2(w) Weignorethediceandonly considerthevariables X,

D1,and D2andthevalues theytake.

Discreterandomvariables
Adiscreterandomvariable, X,isavariablethatcantake valuesinadiscrete(countable)set{ xi}.

Theprobabilityof Xtakingthevalue xiisdenoted p(X=xi) subset{ xj} { xi}: p(X{xj}) = jp(xj).

andsatisfies p(X=xi) 0 forall i, i p(X=xi) = 1,andforany

Sect.1.2

Discreterandomvariables
Adiscreterandomvariable, X,isavariablethatcantake valuesinadiscrete(countable)set{ xi}.

Theprobabilityof Xtakingthevalue xiisdenoted p(X=xi) subset{ xj} { xi}: p(X{xj}) = jp(xj).

andsatisfies p(X=xi) 0 forall i, i p(X=xi) = 1,andforany Intuition/interpretation:Ifwerepeatanexperiment(sampling avaluefor X) ntimes,anddenoteby nithenumberoftimes weobserve X=xi,then ni/n p(X=xi)as n.

Sect.1.2

Discreterandomvariables
Adiscreterandomvariable, X,isavariablethatcantake valuesinadiscrete(countable)set{ xi}.

Theprobability of Xtakingthevalue xiisdenoted p(X=xi) Thisisthe intuitionnotadefinition! (Definitionsbasedonthisendsupgoingincircles). andsatisfies p( X=xi) 0 forall i, i p(X=xi) = 1,andforany subset{ xj} { xi}: p(X{xj}) = jp(xj). Thedefinitionsarepureabstractmath.Anyreal worldusefulnessispureluck. Intuition/interpretation: Ifwerepeatanexperiment(sampling avaluefor X) ntimes,anddenoteby nithenumberoftimes weobserve X=xi,then ni/n p(X=xi)as n.

Sect.1.2

Discreterandomvariables
Adiscreterandomvariable, X,isavariablethatcantake valuesinadiscrete(countable)set{ xi}.

Theprobability of Xtakingthevalue xiisdenoted p(X=xi) Thisisthe intuitionnotadefinition! (Definitionsbasedonthisendsupgoingincircles). andsatisfies p( X=xi) 0 forall i, i p(X=xi) = 1,andforany subset{ xj } { xi}: p(X{xj}) = jp(xj). Weoftensimplifythenotationanduseboth p(X) Thedefinitionsarepureabstractmath.Anyreal worldusefulnessispureluck. and p (xi)for p(X =xi),dependingoncontext. Intuition/interpretation: Ifwerepeatanexperiment(sampling avaluefor X) ntimes,anddenoteby nithenumberoftimes weobserve X=xi,then ni/n p(X=xi)as n.

Sect.1.2

Jointprobability
Ifarandomvariable, Z,isavector, Z=(X,Y),wecan consideritscomponentsseparetly. Theprobability p(Z=z)where z = (x,y)isthejointprobability of X=xand Y=ywritten p(X=x,Y=y)or p(x,y).

Whenclearfromcontext,wewritejust p(X,Y)or p(x,y)and

thenotationissymmetric: p(X,Y) = p(Y,X)and p(x,y)=p(y,x). Theprobabilityof X {xi}andY {yj}becomes i j p(xi,yj). Sect.1.2

Marginalprobability
Theprobabilityof X=xiregardlessofthevalueof Ythen of X andiswrittenjust p(xi).

becomes j p(xi,yj)andisdenotedthemarginalprobability

Thesumrule: (1.10)

Sect.1.2

Conditionalprobability
Theconditionalprobabilityof XgivenYiswritten P(X|Y) andisthequantitysatisfying p(X,Y) = p(X|Y)p(Y). Theproductrule: (1.11) When p(Y) 0weget p(X|Y) = p(X,Y) / p(Y)withasimple interpretation.

Sect.1.2

Conditionalprobability
andisthequantitysatisfying p(X,Y) = p(X|Y)p(Y). Intuition:Beforeweobserveanything,theprobabilityof Xis p(XTheproductrule: )butafterweobserve Yitbecomes p(X|Y). Theconditionalprobabilityof XgivenYiswritten P(X|Y)

(1.11) When p(Y) 0weget p(X|Y) = p(X,Y) / p(Y)withasimple interpretation.

Sect.1.2

Independence
When p(X,Y) = p(X)p(Y)wesaythat XandYare independent. Inthiscase:

Intuition/justification:Observing Ydoesnotchangethe probabilityof X.

Sect.1.2

Example
Bcolourofbucket F kindoffruit

Sect.1.2

Example
Bcolourofbucket F kindoffruit

p(F=a,B=r) = p(F=a|B=r)p(B=r) = 2/8 4/10 = 1/10 p(F=a,B=b) = p(F=a|B=b)p(B=b) = 2/8 6/10 = 9/20 Sect.1.2

Example
Bcolourofbucket F kindoffruit

p(F=a) = p(F=a,B=r) + p(F=a,B=b) = 1/10 + 9/10 = 11/20 p(F=a,B=r) = p(F=a|B=r)p(B=r) = 2/8 4/10 = 1/10 p(F=a,B=b) = p(F=a|B=b)p(B=b) = 2/8 6/10 = 9/20 Sect.1.2

Bayes'theorem
Since p(X,Y) = p(Y,X)(symmetry)and p(X,Y) = p(Y|X)p(X) (productrule)itfollows p(Y|X)p(X) = p(X|Y)p(Y)or,when p(X) 0:

Bayes'theorem:

(1.12)

Sometimeswritten: p(Y|X) p(X|Y)p(Y)where

p(X) =Y p(X|Y)p(Y)isanimplicitnormalisingfactor.

Sect.1.2

Bayes'theorem
Since p(X,Y) = p(Y,X)(symmetry)and p(X,Y) = p(Y|X)p(X) (productrule)itfollows p(Y|X)p(X) = p(X|Y)p(Y)or,when p(X) 0:

Bayes'theorem: Posteriorof Y Priorof Y

(1.12)

Sometimeswritten: p(Y|X) p(X|Y)p(Y)where

p(X) =Y p(X|Y)p(Y)isanimplicitnormalisingfactor. Likelihoodof Y Sect.1.2

Bayes'theorem
(productrule)itfollows p(Y|X)p(X) = p(X|Y)p(Y)or,when Interpretation: p(X) 0: Priortoanexperiment,theprobabilityof Yis p(Y) Afterobserving X,theprobabilityis p(Y|X) Bayes'theorem: Since p(X,Y) = p(Y,X)(symmetry)and p(X,Y) = p(Y|X)p(X)

Bayes'theoremtellsushowtomovefrompriortoposterior. Sometimeswritten: p(Y|X) p(X|Y)p(Y)where

(1.12)

p(X) =Y p(X|Y)p(Y)isanimplicitnormalisingfactor.

Sect.1.2

Bayes'theorem
(productrule)itfollows p(Y|X)p(X) = p(X|Y)p(Y)or,when Interpretation: p(X) 0: Priortoanexperiment,theprobabilityof Yis p(Y) Since p(X,Y) = p(Y,X)(symmetry)and p(X,Y) = p(Y|X)p(X)

Afterobserving X,theprobabilityis p(Y|X) Thisispossiblythemostimportantequation Bayes'theorem: intheentireclass! (1.12) Bayes'theoremtellsushowtomovefrompriortoposterior. Sometimeswritten: p(Y|X) p(X|Y)p(Y)where

p(X) =Y p(X|Y)p(Y)isanimplicitnormalisingfactor.

Sect.1.2

Example
Bcolourofbucket F kindoffruit Ifwedrawanoragne,whatistheprobabilitywedrewitfrom thebluebasket?

Sect.1.2

Continuousrandomvariables
Acontinuousrandomvariable, X,isavariablethatcan takevaluesin Rd.

Theprobabilitydensityof Xisanintegrabelfunction p(X) satisfying p(x) 0forall xand p(x) dx = 1.

Theprobabilityof X S Rdisgivenby p(S) = S p(x) dx.

Sect.1.2.1

Expectation
Theexpectationormeanofafunction fofrandomvariable Xisaweightedaverage

Forbothdiscreteandcontinuousrandomvariables: (1.35) as N when xn ~ p(X). Sect.1.2.2

Expectation
Intuition:Ifyourepeatedlyplayagamewithgain f(x),your expectedoverallgainafter ngameswillbe n E[f]. Theaccuracyofthispredictionincreaseswith n.

Itmightnotevenbepossibletogain E[f]inasinglegame.

Sect.1.2.2

Expectation
Intuition:Ifyourepeatedlyplayagamewithgain f(x),your expectedoverallgainafter ngameswillbe n E[f]. Theaccuracyofthispredictionincreaseswith n.

Itmightnotevenbepossibletogain E[f]inasinglegame. Example:Gameofdicewithafairdice, Dvalueofdice, gainfunction f(d) = d.

Sect.1.2.2

Variance
Thevarianceof f(x)isdefinedas andcanbeseenasameasureofvariabilityaroundthemean. Thecovarianceof Xand Yisdefinedas andmeasuresthevariabilityofthetwovariablestogether.

Sect.1.2.2

Variance
Thevarianceof f(x)isdefinedas andcanbeseenasameasureofvariabilityaroundthemean. Thecovarianceof Xand Yisdefinedas andmeasuresthevariabilityofthetwovariablestogether. Whencov[x,y] > 0,whenxisabovemean, ytendstobe. Whencov [x,y]<0,whenxisabovemean, ytendstobebelow. Whencov [x,y]=0, xand yareuncorrelated(notnecessarily independent;independeceimpliesuncorrelated,though).

Sect.1.2.2

Covariance

cov [x1,x2]>0

cov [x1,x2]=0

Whencov[x,y] > 0,whenxisabovemean, ytendstobe.

Whencov [x,y]<0,whenxisabovemean, ytendstobebelow. Whencov [x,y]=0, xand yareuncorrelated(notnecessarily independent;independeceimpliesuncorrelated,though).

Sect.1.2.2

Parameterizeddistributions
Manydistributionsare governedbyafew parameters. E.g.cointossing (Bernoullydistribution) governedbythe probabilityofheads. Binomialdistribution: numberofheads kout of ncointosses:

Parameterizeddistributions
Manydistributionsare Wecanthinkofaparameterizeddistributionasa governedbyafew conditionaldistribution. parameters. Thefunction x p ( x | q ) isthe probability of E.g.cointossing observation xgivenparameter q. (Bernoullydistribution) governedbythe probabilityofheads. Thefunction q p(x | q)isthelikelihoodof

Binomialdistribution: parameter qgivenobservation x.Sometimeswritten numberofheads kout lhd(q | x) = p(x | q). of ncointosses:

Parameterizeddistributions
Manydistributionsare Wecanthinkofaparameterizeddistributionasa governedbyafew conditionaldistribution. parameters. Thefunction x p ( x | q ) isthe probability of E.g.cointossing observation xgivenparameter q. (Bernoullydistribution) governedbythe probabilityofheads. Thefunction q p(x | q)isthelikelihoodof

Binomialdistribution: parameter qgivenobservation x.Sometimeswritten numberofheads kout lhd(q | x) = p(x | q). of ncointosses: Thelikelihood,ingeneral,isnotaprobability distribution.

Parameterestimation
Generally,parametersarenotknowbutmostbeestimatedfrom observeddata. MaximumLikelihood(ML): MaximumAPosteriori(MAP): (ABayesianapproachassuming adistributionoverparameters). FullyBayesian: (Estimatesadistributionrather thanaparameter).

Parameterestimation
Example:Wetossacoinandgetahead.Ourmodelisa binomialdistribution; xisoneheadand qtheprobabilityofa head. Likelihood:

Prior:

Posterior:

Parameterestimation
Example:Wetossacoinandgetahead.Ourmodelisa binomialdistribution; xisoneheadand qtheprobabilityofa head. Likelihood: MLestimate Prior:

Posterior:

Parameterestimation
Example:Wetossacoinandgetahead.Ourmodelisa binomialdistribution; xisoneheadand qtheprobabilityofa head. Likelihood:

Prior:

Posterior:

MAPestimate

Parameterestimation
Example:Wetossacoinandgetahead.Ourmodelisa binomialdistribution; xisoneheadand qtheprobabilityofa head. FullyBayesianapproach: Likelihood:

Prior:

Posterior:

Predictions
Assumenowknownjointdistribution p(x,t | q)ofexplanatory p(t | x, q)tomakepredictionsabout t. variable xandtargetvariable t.Whenobservingnew xwecanuse

Decisiontheory
Basedon p(x,t | q)weoftenneedtomakedecisions. Thisoftenmeanstakingoneofasmallsetofactions A1,A2,...,Ak basedonobserved x.

Assumethatthetargetvariableisinthisset,thenwemake decisionsbasedon p(t | x, q) = p( Ai | x, q). Putinadifferentway:weuse p(x,t | q)toclassify xintooneof k classes, Ci.

Sect.1.5

Decisiontheory
Wecanapproachthisbysplittingtheinputintoregions, Ri,and makedecisionsbasedonthese: In R1gofor C1;in R2gofor C2. Chooseregionstominimize classificationerrors:

Sect.1.5

Decisiontheory
Wecanapproachthisbysplittingtheinputintoregions, Ri,and makedecisionsbasedonthese: In R1gofor C1;in R2gofor C2. Chooseregionstominimize classificationerrors:

Redandgreenmisclassifies C2as C1 Bluemisclassifies C1as C2

At x0redisgoneand p(mistake)isminimized Sect.1.5

Decisiontheory
Wecanapproachthisbysplittingtheinputintoregions, Ri,and makedecisionsbasedonthese: In R1gofor C1;in R2gofor C2. Chooseregionstominimize classificationerrors: x0iswhere p(x, C1) = p(x,C2)orsimilarly

p(C1 | x)p(x) = p(C2 | x)p(x)sowegetthe intuitivepleasing:

Sect.1.5

Modelselection
Wheredoweget p(t,x | q)frominthefirstplace?

Sect.1.3

Modelselection
Wheredoweget p(t,x | q)frominthefirstplace? Thereisnorightmodelafaircoinorfairdiceisasunrealisticas asphericalcow!

Sect.1.3

Modelselection
Wheredoweget p(t,x | q)frominthefirstplace? Thereisnorightmodelafaircoinorfairdiceisasunrealisticas asphericalcow! Sometimesthereareobviouscandidatestotryeitherforthejoint orconditionalprobabilities p(x,t | q)or p(t | x, q). Sometimeswecantrya"generic"modellinearmodels,neural networks,...

Sect.1.3

Modelselection
Wheredoweget p(t,x | q)frominthefirstplace? Thereisnorightmodelafaircoinorfairdiceisasunrealisticas asphericalcow! Sometimesthereareobviouscandidatestotryeitherforthejoint orconditionalprobabilities p(x,t | q)or p(t | x, q). Sometimeswecantrya"generic"modellinearmodels,neural networks,... Thisisthetopicofmostofthisclass! Sect.1.3

Modelselection
Wheredoweget p(t,x | q)frominthefirstplace? Thereisnorightmodelafaircoinorfairdiceisasunrealisticas asphericalcow!

Sect.1.3

Modelselection
Wheredoweget p(t,x | q)frominthefirstplace? Thereisnorightmodelafaircoinorfairdiceisasunrealisticas asphericalcow! Butsomemodelsaremoreusefulthanothers.

Sect.1.3

Modelselection
Wheredoweget p(t,x | q)frominthefirstplace? Thereisnorightmodelafaircoinorfairdiceisasunrealisticas asphericalcow! Butsomemodelsaremoreusefulthanothers. Ifwehaveseveralmodels,howdowemeasuretheusefulnessof each?

Sect.1.3

Modelselection
Wheredoweget p(t,x | q)frominthefirstplace? Thereisnorightmodelafaircoinorfairdiceisasunrealisticas asphericalcow! Butsomemodelsaremoreusefulthanothers. Ifwehaveseveralmodels,howdowemeasuretheusefulnessof each? Agoodmeasureispredictionaccuracyonnewdata. Sect.1.3

Modelselection
Ifwecomparetwomodels,wecantakeamaximumlikelihood approach:

oraBayesianapproach:

justasforparameters.

Sect.1.3

Modelselection
Ifwecomparetwomodels,wecantakeamaximumlikelihood approach: Butthereisanoverfittingproblem: Complexmodelsoftenfittraining databetterwithoutgeneralizing oraBayesianapproach: better! justasforparameters.

Sect.1.3

Modelselection
Ifwecomparetwomodels,wecantakeamaximumlikelihood approach: Butthereisanoverfittingproblem: Complexmodelsoftenfittraining databetterwithoutgeneralizing oraBayesianapproach: better! InBayesianapproach,use p(M)topenalize complexmodels justasforparameters. InMLapproach,usesomeInformationCriteria andmaximize ln p(t,x |M) penalty( M). Sect.1.3

Modelselection
Ifwecomparetwomodels,wecantakeamaximumlikelihood approach: Butthereisanoverfittingproblem: Complexmodelsoftenfittraining Ormoreempiricalapproach:Use databetterwithoutgeneralizing oraBayesianapproach: somemethodofsplittingdatainto better! trainingdataandtestdataandpick modelthatperformsbestontestdata. justasforparameters. (andretrainthatmodelwiththefull dataset).

Sect.1.3

Summary

Probabilities Stochasticvariables Marginalandconditional probabilities Bayes'theorem Expectation,varianceand covariance Estimation Decisiontheoryand modelselection

Das könnte Ihnen auch gefallen