Probability and Stats Intro PDF

Crashcourseinprobabilitytheoryand statisticspart1
MachineLearning,TueApr10,2007
Motivation
Problem:Toavoidrelyingonmagicweneed mathematics.Formachinelearningweneedto quantify: Uncertaintyindatameasuresandconclusions Goodnessofmodel(whenconfrontedwithdata) Expectederrorandexpectedsuccessrates ...andmanysimilarquantities...
Motivation
Problem:Toavoidrelyingonmagicweneed mathematics.Formachinelearningweneedto quantify: Uncertaintyindatameasuresandconclusions Goodnessofmodel(whenconfrontedwithdata) Expectederrorandexpectedsuccessrates ...andmanysimilarquantities... Probabilitytheory:Mathematicalmodelingwhen uncertaintyorrandomnessispresent.
P X = x i , Y = y j = pij
Motivation
Problem:Toavoidrelyingonmagicweneed mathematics.Formachinelearningweneedto quantify: Uncertaintyindatameasuresandconclusions nij P X = x i , Y = y j = Goodnessofmodel(whenconfrontedwithdata) n Expectederrorandexpectedsuccessrates ...andmanysimilarquantities... Probabilitytheory:Mathematicalmodelingwhen uncertaintyorrandomnessispresent. Statistics:Themathematicsofcollectionofdata, descriptionofdata,andinferencefromdata
Introductiontoprobabilitytheory
Notice:Thiswillbeaninformalintroductionto probabilitytheory(measuretheoryoutofscopefor thiscourse).Nosigmaalgebras,Borelsets,etc. Forthepurposeofthisclass,ourintuitionwillberight ...inmorecomplexsettingsitcanbeverywrong. Weleavethecomplexsetupstothemathematicians andsticktonicemodels.
Notice:Thiswillbeaninformalintroductionto probabilitytheory(measuretheoryoutofscopefor thiscourse).Nosigmaalgebras,Borelsets,etc. Thisintroductionwillbebasedonstochastic Forthepurposeofthisclass,ourintuitionwillberight (random)variables. ...inmorecomplexsettingsitcanbe verywrong. Weleavethecomplexsetupstothemathematicians andsticktonicemodels.
Notice:Thiswillbeaninformalintroductionto probabilitytheory(measuretheoryoutofscopefor thiscourse).Nosigmaalgebras,Borelsets,etc. Thisintroductionwillbebasedonstochastic Forthepurposeofthisclass,ourintuitionwillberight (random)variables. ...inmorecomplexsettingsitcanbe verywrong. Weignoretheunderlyingprobabilityspace (W,A,p). Weleavethecomplexsetupstothemathematicians andsticktonicemodels.
Notice:Thiswillbeaninformalintroductionto probabilitytheory(measuretheoryoutofscopefor thiscourse).Nosigmaalgebras,Borelsets,etc. Thisintroductionwillbebasedonstochastic Forthepurposeofthisclass,ourintuitionwillberight (random)variables. ...inmorecomplexsettingsitcanbe verywrong. Weignoretheunderlyingprobabilityspace (W,A,p). Weleavethecomplexsetupstothemathematicians andsticktonicemodels. If Xisthesumoftwodice: X(w) = D1(w) + D2(w)
Notice:Thiswillbeaninformalintroductionto probabilitytheory(measuretheoryoutofscopefor thiscourse).Nosigmaalgebras,Borelsets,etc. Thisintroductionwillbebasedonstochastic Forthepurposeofthisclass,ourintuitionwillberight (random)variables. ...inmorecomplexsettingsitcanbe verywrong. Weignoretheunderlyingprobabilityspace (W,A,p). Weleavethecomplexsetupstothemathematicians andsticktonicemodels. If Xisthesumoftwodice: X(w) = D1(w) + D2(w) Weignorethediceandonly considerthevariables X,
D1,and D2andthevalues theytake.
Discreterandomvariables
Adiscreterandomvariable, X,isavariablethatcantake valuesinadiscrete(countable)set{ xi}.
Theprobabilityof Xtakingthevalue xiisdenoted p(X=xi) subset{ xj} { xi}: p(X{xj}) = jp(xj).
andsatisfies p(X=xi) 0 forall i, i p(X=xi) = 1,andforany
Sect.1.2
Theprobabilityof Xtakingthevalue xiisdenoted p(X=xi) subset{ xj} { xi}: p(X{xj}) = jp(xj).
andsatisfies p(X=xi) 0 forall i, i p(X=xi) = 1,andforany Intuition/interpretation:Ifwerepeatanexperiment(sampling avaluefor X) ntimes,anddenoteby nithenumberoftimes weobserve X=xi,then ni/n p(X=xi)as n.
Sect.1.2
Theprobability of Xtakingthevalue xiisdenoted p(X=xi) Thisisthe intuitionnotadefinition! (Definitionsbasedonthisendsupgoingincircles). andsatisfies p( X=xi) 0 forall i, i p(X=xi) = 1,andforany subset{ xj} { xi}: p(X{xj}) = jp(xj). Thedefinitionsarepureabstractmath.Anyreal worldusefulnessispureluck. Intuition/interpretation: Ifwerepeatanexperiment(sampling avaluefor X) ntimes,anddenoteby nithenumberoftimes weobserve X=xi,then ni/n p(X=xi)as n.
Sect.1.2
Theprobability of Xtakingthevalue xiisdenoted p(X=xi) Thisisthe intuitionnotadefinition! (Definitionsbasedonthisendsupgoingincircles). andsatisfies p( X=xi) 0 forall i, i p(X=xi) = 1,andforany subset{ xj } { xi}: p(X{xj}) = jp(xj). Weoftensimplifythenotationanduseboth p(X) Thedefinitionsarepureabstractmath.Anyreal worldusefulnessispureluck. and p (xi)for p(X =xi),dependingoncontext. Intuition/interpretation: Ifwerepeatanexperiment(sampling avaluefor X) ntimes,anddenoteby nithenumberoftimes weobserve X=xi,then ni/n p(X=xi)as n.
Sect.1.2
Jointprobability
Ifarandomvariable, Z,isavector, Z=(X,Y),wecan consideritscomponentsseparetly. Theprobability p(Z=z)where z = (x,y)isthejointprobability of X=xand Y=ywritten p(X=x,Y=y)or p(x,y).
Whenclearfromcontext,wewritejust p(X,Y)or p(x,y)and
thenotationissymmetric: p(X,Y) = p(Y,X)and p(x,y)=p(y,x). Theprobabilityof X {xi}andY {yj}becomes i j p(xi,yj). Sect.1.2
Marginalprobability
Theprobabilityof X=xiregardlessofthevalueof Ythen of X andiswrittenjust p(xi).
becomes j p(xi,yj)andisdenotedthemarginalprobability
Thesumrule: (1.10)
Sect.1.2
Conditionalprobability
Theconditionalprobabilityof XgivenYiswritten P(X|Y) andisthequantitysatisfying p(X,Y) = p(X|Y)p(Y). Theproductrule: (1.11) When p(Y) 0weget p(X|Y) = p(X,Y) / p(Y)withasimple interpretation.
Sect.1.2
Conditionalprobability
andisthequantitysatisfying p(X,Y) = p(X|Y)p(Y). Intuition:Beforeweobserveanything,theprobabilityof Xis p(XTheproductrule: )butafterweobserve Yitbecomes p(X|Y). Theconditionalprobabilityof XgivenYiswritten P(X|Y)
(1.11) When p(Y) 0weget p(X|Y) = p(X,Y) / p(Y)withasimple interpretation.
Sect.1.2
Independence
When p(X,Y) = p(X)p(Y)wesaythat XandYare independent. Inthiscase:
Intuition/justification:Observing Ydoesnotchangethe probabilityof X.
Sect.1.2
Example
Bcolourofbucket F kindoffruit
Sect.1.2
Example
p(F=a,B=r) = p(F=a|B=r)p(B=r) = 2/8 4/10 = 1/10 p(F=a,B=b) = p(F=a|B=b)p(B=b) = 2/8 6/10 = 9/20 Sect.1.2
Example
p(F=a) = p(F=a,B=r) + p(F=a,B=b) = 1/10 + 9/10 = 11/20 p(F=a,B=r) = p(F=a|B=r)p(B=r) = 2/8 4/10 = 1/10 p(F=a,B=b) = p(F=a|B=b)p(B=b) = 2/8 6/10 = 9/20 Sect.1.2
Bayes'theorem
Since p(X,Y) = p(Y,X)(symmetry)and p(X,Y) = p(Y|X)p(X) (productrule)itfollows p(Y|X)p(X) = p(X|Y)p(Y)or,when p(X) 0:
Bayes'theorem:
(1.12)
Sometimeswritten: p(Y|X) p(X|Y)p(Y)where
p(X) =Y p(X|Y)p(Y)isanimplicitnormalisingfactor.
Sect.1.2
Bayes'theorem
Since p(X,Y) = p(Y,X)(symmetry)and p(X,Y) = p(Y|X)p(X) (productrule)itfollows p(Y|X)p(X) = p(X|Y)p(Y)or,when p(X) 0:
Bayes'theorem: Posteriorof Y Priorof Y
(1.12)
Sometimeswritten: p(Y|X) p(X|Y)p(Y)where
p(X) =Y p(X|Y)p(Y)isanimplicitnormalisingfactor. Likelihoodof Y Sect.1.2
Bayes'theorem
(productrule)itfollows p(Y|X)p(X) = p(X|Y)p(Y)or,when Interpretation: p(X) 0: Priortoanexperiment,theprobabilityof Yis p(Y) Afterobserving X,theprobabilityis p(Y|X) Bayes'theorem: Since p(X,Y) = p(Y,X)(symmetry)and p(X,Y) = p(Y|X)p(X)
Bayes'theoremtellsushowtomovefrompriortoposterior. Sometimeswritten: p(Y|X) p(X|Y)p(Y)where
(1.12)
Sect.1.2
Bayes'theorem
(productrule)itfollows p(Y|X)p(X) = p(X|Y)p(Y)or,when Interpretation: p(X) 0: Priortoanexperiment,theprobabilityof Yis p(Y) Since p(X,Y) = p(Y,X)(symmetry)and p(X,Y) = p(Y|X)p(X)
Afterobserving X,theprobabilityis p(Y|X) Thisispossiblythemostimportantequation Bayes'theorem: intheentireclass! (1.12) Bayes'theoremtellsushowtomovefrompriortoposterior. Sometimeswritten: p(Y|X) p(X|Y)p(Y)where
Sect.1.2
Example
Bcolourofbucket F kindoffruit Ifwedrawanoragne,whatistheprobabilitywedrewitfrom thebluebasket?
Sect.1.2
Continuousrandomvariables
Acontinuousrandomvariable, X,isavariablethatcan takevaluesin Rd.
Theprobabilitydensityof Xisanintegrabelfunction p(X) satisfying p(x) 0forall xand p(x) dx = 1.
Theprobabilityof X S Rdisgivenby p(S) = S p(x) dx.
Sect.1.2.1
Expectation
Theexpectationormeanofafunction fofrandomvariable Xisaweightedaverage
Forbothdiscreteandcontinuousrandomvariables: (1.35) as N when xn ~ p(X). Sect.1.2.2
Expectation
Intuition:Ifyourepeatedlyplayagamewithgain f(x),your expectedoverallgainafter ngameswillbe n E[f]. Theaccuracyofthispredictionincreaseswith n.
Itmightnotevenbepossibletogain E[f]inasinglegame.
Sect.1.2.2
Expectation
Intuition:Ifyourepeatedlyplayagamewithgain f(x),your expectedoverallgainafter ngameswillbe n E[f]. Theaccuracyofthispredictionincreaseswith n.
Itmightnotevenbepossibletogain E[f]inasinglegame. Example:Gameofdicewithafairdice, Dvalueofdice, gainfunction f(d) = d.
Sect.1.2.2
Variance
Thevarianceof f(x)isdefinedas andcanbeseenasameasureofvariabilityaroundthemean. Thecovarianceof Xand Yisdefinedas andmeasuresthevariabilityofthetwovariablestogether.
Sect.1.2.2
Variance
Thevarianceof f(x)isdefinedas andcanbeseenasameasureofvariabilityaroundthemean. Thecovarianceof Xand Yisdefinedas andmeasuresthevariabilityofthetwovariablestogether. Whencov[x,y] > 0,whenxisabovemean, ytendstobe. Whencov [x,y]<0,whenxisabovemean, ytendstobebelow. Whencov [x,y]=0, xand yareuncorrelated(notnecessarily independent;independeceimpliesuncorrelated,though).
Sect.1.2.2
Covariance
cov [x1,x2]>0
cov [x1,x2]=0
Whencov[x,y] > 0,whenxisabovemean, ytendstobe.
Whencov [x,y]<0,whenxisabovemean, ytendstobebelow. Whencov [x,y]=0, xand yareuncorrelated(notnecessarily independent;independeceimpliesuncorrelated,though).
Sect.1.2.2
Parameterizeddistributions
Manydistributionsare governedbyafew parameters. E.g.cointossing (Bernoullydistribution) governedbythe probabilityofheads. Binomialdistribution: numberofheads kout of ncointosses:
Manydistributionsare Wecanthinkofaparameterizeddistributionasa governedbyafew conditionaldistribution. parameters. Thefunction x p ( x | q ) isthe probability of E.g.cointossing observation xgivenparameter q. (Bernoullydistribution) governedbythe probabilityofheads. Thefunction q p(x | q)isthelikelihoodof
Binomialdistribution: parameter qgivenobservation x.Sometimeswritten numberofheads kout lhd(q | x) = p(x | q). of ncointosses:
Manydistributionsare Wecanthinkofaparameterizeddistributionasa governedbyafew conditionaldistribution. parameters. Thefunction x p ( x | q ) isthe probability of E.g.cointossing observation xgivenparameter q. (Bernoullydistribution) governedbythe probabilityofheads. Thefunction q p(x | q)isthelikelihoodof
Binomialdistribution: parameter qgivenobservation x.Sometimeswritten numberofheads kout lhd(q | x) = p(x | q). of ncointosses: Thelikelihood,ingeneral,isnotaprobability distribution.
Parameterestimation
Generally,parametersarenotknowbutmostbeestimatedfrom observeddata. MaximumLikelihood(ML): MaximumAPosteriori(MAP): (ABayesianapproachassuming adistributionoverparameters). FullyBayesian: (Estimatesadistributionrather thanaparameter).
Parameterestimation
Example:Wetossacoinandgetahead.Ourmodelisa binomialdistribution; xisoneheadand qtheprobabilityofa head. Likelihood:
Prior:
Posterior:
Parameterestimation
Example:Wetossacoinandgetahead.Ourmodelisa binomialdistribution; xisoneheadand qtheprobabilityofa head. Likelihood: MLestimate Prior:
Posterior:
Parameterestimation
Example:Wetossacoinandgetahead.Ourmodelisa binomialdistribution; xisoneheadand qtheprobabilityofa head. Likelihood:
Prior:
Posterior:
MAPestimate
Parameterestimation
Example:Wetossacoinandgetahead.Ourmodelisa binomialdistribution; xisoneheadand qtheprobabilityofa head. FullyBayesianapproach: Likelihood:
Prior:
Posterior:
Predictions
Assumenowknownjointdistribution p(x,t | q)ofexplanatory p(t | x, q)tomakepredictionsabout t. variable xandtargetvariable t.Whenobservingnew xwecanuse
Decisiontheory
Basedon p(x,t | q)weoftenneedtomakedecisions. Thisoftenmeanstakingoneofasmallsetofactions A1,A2,...,Ak basedonobserved x.
Assumethatthetargetvariableisinthisset,thenwemake decisionsbasedon p(t | x, q) = p( Ai | x, q). Putinadifferentway:weuse p(x,t | q)toclassify xintooneof k classes, Ci.
Sect.1.5
Decisiontheory
Wecanapproachthisbysplittingtheinputintoregions, Ri,and makedecisionsbasedonthese: In R1gofor C1;in R2gofor C2. Chooseregionstominimize classificationerrors:
Sect.1.5
Decisiontheory
Wecanapproachthisbysplittingtheinputintoregions, Ri,and makedecisionsbasedonthese: In R1gofor C1;in R2gofor C2. Chooseregionstominimize classificationerrors:
Redandgreenmisclassifies C2as C1 Bluemisclassifies C1as C2
At x0redisgoneand p(mistake)isminimized Sect.1.5
Decisiontheory
Wecanapproachthisbysplittingtheinputintoregions, Ri,and makedecisionsbasedonthese: In R1gofor C1;in R2gofor C2. Chooseregionstominimize classificationerrors: x0iswhere p(x, C1) = p(x,C2)orsimilarly
p(C1 | x)p(x) = p(C2 | x)p(x)sowegetthe intuitivepleasing:
Sect.1.5
Modelselection
Wheredoweget p(t,x | q)frominthefirstplace?
Sect.1.3
Modelselection
Wheredoweget p(t,x | q)frominthefirstplace? Thereisnorightmodelafaircoinorfairdiceisasunrealisticas asphericalcow!
Sect.1.3
Modelselection
Wheredoweget p(t,x | q)frominthefirstplace? Thereisnorightmodelafaircoinorfairdiceisasunrealisticas asphericalcow! Sometimesthereareobviouscandidatestotryeitherforthejoint orconditionalprobabilities p(x,t | q)or p(t | x, q). Sometimeswecantrya"generic"modellinearmodels,neural networks,...
Sect.1.3
Modelselection
Wheredoweget p(t,x | q)frominthefirstplace? Thereisnorightmodelafaircoinorfairdiceisasunrealisticas asphericalcow! Sometimesthereareobviouscandidatestotryeitherforthejoint orconditionalprobabilities p(x,t | q)or p(t | x, q). Sometimeswecantrya"generic"modellinearmodels,neural networks,... Thisisthetopicofmostofthisclass! Sect.1.3
Modelselection
Wheredoweget p(t,x | q)frominthefirstplace? Thereisnorightmodelafaircoinorfairdiceisasunrealisticas asphericalcow!
Sect.1.3
Modelselection
Wheredoweget p(t,x | q)frominthefirstplace? Thereisnorightmodelafaircoinorfairdiceisasunrealisticas asphericalcow! Butsomemodelsaremoreusefulthanothers.
Sect.1.3
Modelselection
Wheredoweget p(t,x | q)frominthefirstplace? Thereisnorightmodelafaircoinorfairdiceisasunrealisticas asphericalcow! Butsomemodelsaremoreusefulthanothers. Ifwehaveseveralmodels,howdowemeasuretheusefulnessof each?
Sect.1.3
Modelselection
Wheredoweget p(t,x | q)frominthefirstplace? Thereisnorightmodelafaircoinorfairdiceisasunrealisticas asphericalcow! Butsomemodelsaremoreusefulthanothers. Ifwehaveseveralmodels,howdowemeasuretheusefulnessof each? Agoodmeasureispredictionaccuracyonnewdata. Sect.1.3
Modelselection
Ifwecomparetwomodels,wecantakeamaximumlikelihood approach:
oraBayesianapproach:
justasforparameters.
Sect.1.3
Modelselection
Ifwecomparetwomodels,wecantakeamaximumlikelihood approach: Butthereisanoverfittingproblem: Complexmodelsoftenfittraining databetterwithoutgeneralizing oraBayesianapproach: better! justasforparameters.
Sect.1.3
Modelselection
Ifwecomparetwomodels,wecantakeamaximumlikelihood approach: Butthereisanoverfittingproblem: Complexmodelsoftenfittraining databetterwithoutgeneralizing oraBayesianapproach: better! InBayesianapproach,use p(M)topenalize complexmodels justasforparameters. InMLapproach,usesomeInformationCriteria andmaximize ln p(t,x |M) penalty( M). Sect.1.3
Modelselection
Ifwecomparetwomodels,wecantakeamaximumlikelihood approach: Butthereisanoverfittingproblem: Complexmodelsoftenfittraining Ormoreempiricalapproach:Use databetterwithoutgeneralizing oraBayesianapproach: somemethodofsplittingdatainto better! trainingdataandtestdataandpick modelthatperformsbestontestdata. justasforparameters. (andretrainthatmodelwiththefull dataset).
Sect.1.3
Summary
Probabilities Stochasticvariables Marginalandconditional probabilities Bayes'theorem Expectation,varianceand covariance Estimation Decisiontheoryand modelselection

Probability and Stats Intro PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Probability and Stats Intro PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Crashcourseinprobabilitytheoryand statisticspart1

D1,and D2andthevalues theytake.

Theprobabilityof Xtakingthevalue xiisdenoted p(X=xi) subset{ xj} { xi}: p(X{xj}) = jp(xj).

andsatisfies p(X=xi) 0 forall i, i p(X=xi) = 1,andforany

Theprobabilityof Xtakingthevalue xiisdenoted p(X=xi) subset{ xj} { xi}: p(X{xj}) = jp(xj).

Whenclearfromcontext,wewritejust p(X,Y)or p(x,y)and

thenotationissymmetric: p(X,Y) = p(Y,X)and p(x,y)=p(y,x). Theprobabilityof X {xi}andY {yj}becomes i j p(xi,yj). Sect.1.2

(1.11) When p(Y) 0weget p(X|Y) = p(X,Y) / p(Y)withasimple interpretation.

Intuition/justification:Observing Ydoesnotchangethe probabilityof X.

Sometimeswritten: p(Y|X) p(X|Y)p(Y)where

Bayes'theorem: Posteriorof Y Priorof Y

Sometimeswritten: p(Y|X) p(X|Y)p(Y)where

p(X) =Y p(X|Y)p(Y)isanimplicitnormalisingfactor. Likelihoodof Y Sect.1.2

Bayes'theoremtellsushowtomovefrompriortoposterior. Sometimeswritten: p(Y|X) p(X|Y)p(Y)where

Theprobabilitydensityof Xisanintegrabelfunction p(X) satisfying p(x) 0forall xand p(x) dx = 1.

Theprobabilityof X S Rdisgivenby p(S) = S p(x) dx.

Forbothdiscreteandcontinuousrandomvariables: (1.35) as N when xn ~ p(X). Sect.1.2.2

Itmightnotevenbepossibletogain E[f]inasinglegame. Example:Gameofdicewithafairdice, Dvalueofdice, gainfunction f(d) = d.

Whencov[x,y] > 0,whenxisabovemean, ytendstobe.

Whencov [x,y]<0,whenxisabovemean, ytendstobebelow. Whencov [x,y]=0, xand yareuncorrelated(notnecessarily independent;independeceimpliesuncorrelated,though).

Redandgreenmisclassifies C2as C1 Bluemisclassifies C1as C2

At x0redisgoneand p(mistake)isminimized Sect.1.5

p(C1 | x)p(x) = p(C2 | x)p(x)sowegetthe intuitivepleasing:

Das könnte Ihnen auch gefallen