Beruflich Dokumente
Kultur Dokumente
edu/stat501)
Home>1.2Whatisthe"BestFittingLine"?
1.2Whatisthe"BestFittingLine"?
Sinceweareinterestedinsummarizingthetrendbetweentwoquantitativevariables,thenatural
questionarises"whatisthebestfittingline?"Atsomepointinyoureducation,youwereprobably
shownascatterplotof(x,y)dataandwereaskedtodrawthe"mostappropriate"linethroughthe
data.Evenifyouweren't,youcantryitnowonasetofheights(x)andweights(y)of10students,
(student_height_weight.txt)[1] .Lookingattheplotbelow,whichlinethesolidlineorthedashed
linedoyouthinkbestsummarizesthetrendbetweenheightandweight?
Holdontoyouranswer!Inordertoexaminewhichofthetwolinesisabetterfit,wefirstneedto
introducesomecommonnotation:
denotestheobservedresponseforexperimentaluniti
xi denotesthepredictorvalueforexperimentaluniti
^ isthepredictedresponse(orfittedvalue)forexperimentaluniti
y
i
yi
Then,theequationforthebestfittinglineis:
^
y
= b0 + b1 xi
Incidentally,recallthatan"experimentalunit"istheobjectorpersononwhichthemeasurementis
made.Inourheightandweightexample,theexperimentalunitsarestudents.
Let'stryoutthenotationonourexamplewiththetrendsummarizedbythelinew=266.53+6.1376
h.(Notethatthislineisjustamorepreciseversionoftheabovesolidline,w=266.5+6.1h.)The
firstdatapointinthelistindicatesthatstudent1is63inchestallandweighs127pounds.Thatis,x1=
63andy1=127.Doyouseethispointontheplot?Ifweknowthisstudent'sheightbutnothisorher
weight,wecouldusetheequationofthelinetopredicthisorherweight.We'dpredictthestudent's
weighttobe266.53+6.1376(63)or120.1pounds.Thatis,y^1 =120.1.Clearly,ourprediction
wouldn'tbeperfectlycorrectithassome"predictionerror"(or"residualerror").Infact,thesize
ofitspredictionerroris127120.1or6.9pounds.
Youmightwanttorollyourcursorovereachofthe10datapointstomakesureyouunderstandthe
notationusedtokeeptrackofthepredictorvalues,theobservedresponsesandthepredicted
responses:
xi
63 127 120.1
64 121 126.3
66 142 138.5
69 157 157.0
69 162 157.0
71 156 169.2
71 169 169.2
72 165 175.4
73 181 181.5
yi
^
y
10 75 208 193.8
Asyoucansee,thesizeofthepredictionerrordependsonthedatapoint.Ifwedidn'tknowthe
weightofstudent4,theequationofthelinewouldpredicthisorherweighttobe266.53+6.1376(69)
or157pounds.Thesizeofthepredictionerrorhereis162157,or5pounds.
Ingeneral,whenweusey^i
(orresidualerror)ofsize:
= b0 + b1 xi
topredicttheactualresponseyi,wemakeapredictionerror
^
ei = yi y
Alinethatfitsthedata"best"willbeoneforwhichthenpredictionerrorsoneforeachobserved
datapointareassmallaspossibleinsomeoverallsense.Onewaytoachievethisgoalisto
invokethe"leastsquarescriterion,"whichsaysto"minimizethesumofthesquaredprediction
errors."Thatis:
Theequationofthebestfittinglineis:y^i = b0 + b1 xi
Wejustneedtofindthevaluesb0andb1thatmakethesumofthesquaredpredictionerrorsthe
smallestitcanbe.
Thatis,weneedtofindthevaluesb0andb1thatminimize:
n
2
^ )
Q = (yi y
i
i=1
Here'showyoumightthinkaboutthisquantityQ:
Thequantityei = yi y^i isthepredictionerrorfordatapointi.
Thequantitye2i = (yi y^i )2 isthesquaredpredictionerrorfordatapointi.
n
And,thesymboli=1 tellsustoaddupthesquaredpredictionerrorsforallndatapoints.
Incidentally,ifwedidn'tsquarethepredictionerrorei
^
= yi y
togete2i
^ )
= (yi y
i
,thepositive
andnegativepredictionerrorswouldcanceleachotheroutwhensummed,alwaysyielding0.
Now,beingfamiliarwiththeleastsquarescriterion,let'stakeafreshlookatourplotagain.Inlightof
theleastsquarescriterion,whichlinedoyounowthinkisthebestfittingline?
Let'sseehowyoudid!Thefollowingtwosidebysidetablesillustratetheimplementationoftheleast
squarescriterionforthetwolinesupforconsiderationthedashedlineandthesolidline.
w=331.2+7.1h(thedashedline)
i
xi
w=266.53+6.1376h(thesolidline)
^ )
(yi y
xi
118.81
47.076
4.84
27.840
21.16
11.891
2.89
0.001
10.89
25.357
285.61
175.287
15.21
0.057
225.00
107.686
37.21
0.265
44.89
yi
^
y
^ )
(yi y
i
yi
^
y
^ )
(yi y
i
______
766.5
^ )
(yi y
i
201.924
______
597.4
Basedontheleastsquarescriterion,whichequationbestsummarizesthedata?Thesumofthe
squaredpredictionerrorsis766.5forthedashedline,whileitisonly597.4forthesolidline.
Therefore,ofthetwolines,thesolidline,w=266.53+6.1376h,bestsummarizesthedata.But,is
thisequationguaranteedtobethebestfittinglineofallofthepossiblelineswedidn'tevenconsider?
Ofcoursenot!
Ifweusedtheaboveapproachforfindingtheequationofthelinethatminimizesthesumofthe
squaredpredictionerrors,we'dhaveourworkcutoutforus.We'dhavetoimplementtheabove
procedureforaninfinitenumberofpossiblelinesclearly,animpossibletask!Fortunately,
somebodyhasdonesomedirtyworkforusbyfiguringoutformulasfortheinterceptb0andtheslope
b1fortheequationofthelinethatminimizesthesumofthesquaredpredictionerrors.
Theformulasaredeterminedusingmethodsofcalculus.Weminimizetheequationforthesumofthe
squaredpredictionerrors:
n
2
Q = (yi (b0 + b1 xi ))
i=1
(thatis,takethederivativewithrespecttob0andb1,setto0,andsolveforb0andb1)andgetthe
"leastsquaresestimates"forb0andb1:
b1 x
b0 = y
and:
b1 =
i=1
)(yi y
)
( xi x
n
i=1
)
( xi x
Becausetheformulasforb0andb1arederivedusingtheleastsquarescriterion,theresulting
equationy^i = b0 + b1 xi isoftenreferredtoasthe"leastsquaresregressionline,"orsimply
the"leastsquaresline."Itisalsosometimescalledthe"estimatedregressionequation."
Incidentally,notethatinderivingtheaboveformulas,wemadenoassumptionsaboutthedataother
thanthattheyfollowsomesortoflineartrend.
, y
) ,since
Wecanseefromtheseformulasthattheleastsquareslinepassesthroughthepoint(x
,theny = b0 + b1 x
= y
b1 x
+ b1 x
= y
.
whenx = x
Inpractice,youwon'treallyneedtoworryabouttheformulasforb0andb1.Instead,youarearegoing
toletstatisticalsoftware,suchasMinitab,findleastsquareslinesforyou.But,wecanstilllearn
somethingfromtheformulasforb1inparticular.
Ifyoustudytheformulafortheslopeb1:
n
b1 =
i=1
)(yi y
)
( xi x
n
i=1
)
( xi x
youseethatthedenominatorisnecessarilypositivesinceitonlyinvolvessummingpositiveterms.
Therefore,thesignoftheslopeb1issolelydeterminedbythenumerator.Thenumeratortellsus,for
eachdatapoint,tosumuptheproductoftwodistancesthedistanceofthexvaluefromthemean
ofallofthexvaluesandthedistanceoftheyvaluefromthemeanofalloftheyvalues.Let'ssee
howthisdeterminesthesignoftheslopeb1bystudyingthefollowingtwoplots.
Whenistheslopeb1>0?Doyouagreethatthetrendinthefollowingplotispositivethatis,asx
increases,ytendstoincrease?Ifthetrendispositive,thentheslopeb1mustbepositive.Let'ssee
how!
Clickonthebluedatapointintheupperrightquadrant.........Notethattheproductofthetwo
distancesforthisdatapointispositive.Infact,theproductofthetwodistancesispositivefor
anydatapointintheupperrightquadrant.
Now,selectclearandthenclickonthebluedatapointinthelowerleftquadrant.........Notethat
theproductofthetwodistancesforthisdatapointisalsopositive.Infact,theproductofthetwo
distancesispositiveforanydatapointinthelowerleftquadrant.
Addingupallofthesepositiveproductsmustnecessarilyyieldapositivenumber,andhencethe
slopeofthelineb1willbepositive.
Whenistheslopeb1<0?Now,doyouagreethatthetrendinthefollowingplotisnegativethat
is,asxincreases,ytendstodecrease?Ifthetrendisnegative,thentheslopeb1mustbenegative.
Let'sseehow!
Clickonthebluedatapointintheupperleftquadrant.........Notethattheproductofthetwo
distancesforthisdatapointisnegative.Infact,theproductofthetwodistancesisnegativefor
anydatapointintheupperleftquadrant.
Now,selectclearandthenclickonthebluedatapointinthelowerrightquadrant.........Notethat
theproductofthetwodistancesforthisdatapointisalsonegative.Infact,theproductofthe
twodistancesisnegativeforanydatapointinthelowerrightquadrant.
Addingupallofthesenegativeproductsmustnecessarilyyieldanegativenumber,andhencethe
slopeofthelineb1willbenegative.
Nowthatwefinishedthatinvestigation,youcanjustsetasidetheformulasforb0andb1.Again,in
practice,youaregoingtoletstatisticalsoftware,suchasMinitab,findleastsquareslinesforyou.We
canobtaintheestimatedregressionequationintwodifferentplacesinMinitab.Thefollowingplot
illustrateswhereyoucanfindtheleastsquaresline(inbox)onMinitab's"fittedlineplot."
ThefollowingMinitaboutputillustrateswhereyoucanfindtheleastsquaresline(inbox)inMinitab's
"standardregressionanalysis"output.
Notethattheestimatedvaluesb0andb1alsoappearinatableunderthecolumnslabeled
"Predictor"(theinterceptb0isalwaysreferredtoasthe"Constant"inMinitab)and"Coef"(for
"Coefficients").Also,notethatthevalueweobtainedbyminimizingthesumofthesquaredprediction
errors,597.4,appearsinthe"AnalysisofVariance"tableappropriatelyinarowlabeled"Residual
Error"andunderacolumnlabeled"SS"(for"SumofSquares").
Althoughwe'velearnedhowtoobtainthe"estimatedregressioncoefficients"b0andb1,we'venot
yetdiscussedwhatwelearnfromthem.Onethingtheyallowustodoistopredictfutureresponses
oneofthemostcommonusesofanestimatedregressionline.Thisuseisratherstraightforward:
Acommonuseoftheestimated
regressionline.
^
y
i,wt
= 267 + 6.14xi,ht
Predict(mean)weightof66"inch
tallpeople.
^
y
Predict(mean)weightof67"inch
tallpeople.
^
y
i,wt
i,wt
Now,whatdoesb0tellus?Theanswerisobviouswhenyouevaluatetheestimatedregression
equationatx=0.Here,ittellsusthatapersonwhois0inchestallispredictedtoweigh267pounds!
Clearly,thispredictionisnonsense.Thishappenedbecausewe"extrapolated"beyondthe"scopeof
themodel"(therangeofthexvalues).Itisnotmeaningfultohaveaheightof0inches,thatis,the
scopeofthemodeldoesnotincludex=0.So,heretheinterceptb0isnotmeaningful.Ingeneral,if
the"scopeofthemodel"includesx=0,thenb0isthepredictedmeanresponsewhenx=0.
Otherwise,b0isnotmeaningful.
And,whatdoesb1tellus?Theanswerisobviouswhenyousubtractthepredictedweightof66"
inchtallpeoplefromthepredictedweightof67"inchtallpeople.Weobtain144.38138.24=6.14
poundsthevalueofb1.Here,ittellsusthatwepredictthemeanweighttoincreaseby6.14pounds
foreveryadditionaloneinchincreaseinheight.Ingeneral,wecanexpectthemeanresponseto
increaseordecreasebyb1unitsforeveryoneunitincreaseinx.
SourceURL:https://onlinecourses.science.psu.edu/stat501/node/252
Links:
[1]
https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/student_height_weight.txt