12.8Prediction
Recallthethirdexam/ﬁnalexamexample.
Weexaminedthescatterplotandshowedthatthecorrelationcoefﬁcientissigniﬁcant.Wefoundtheequa-
usetheleastsquaresregressionlineforprediction.
thirdexam. Theexamscores(x-values)rangefrom65to75. Since73isbetweenthex-values65and75,
substitutex=73intotheequation.Then:
^
y
173.51+4.83(73)=179.08
(12.8)
theﬁnalexam,onaverage.
Example12.11
Recallthethirdexam/ﬁnalexamexample.
Problem1
Whatwouldyoupredicttheﬁnalexamscoretobeforastudentwhoscoreda66onthethird
exam?
Solution
145.27
Problem2
(Solutiononp.579.)
Whatwouldyoupredicttheﬁnalexamscoretobeforastudentwhoscoreda90onthethird
exam?
12.9Outliers
Insomedatasets, there arevalues(observeddatapoints)called outliersOutliersareobserveddata
pointsthatarefarfromtheleastsquaresline.Theyhavelarge"errors",wherethe"error"orresidualisthe
verticaldistancefromthelinetothepoint.
Outliersneedtobeexaminedclosely.Sometimes,forsomereasonoranother,theyshouldnotbeincluded
intheanalysisofthedata.Itispossiblethatanoutlierisaresultoferroneousdata.Othertimes,anoutlier
Besidesoutliers,asamplemaycontainoneorafewpointsthatarecalled inﬂuentialpoints. Inﬂuential
pointsareobserveddatapointsthatarefarfromtheotherobserveddatapointsinthehorizontaldirection.
Thesepointsmayhaveabigeffectontheslopeoftheregressionline. Tobegintoidentifyaninﬂuential
point,youcanremoveitfromthedatasetandseeiftheslopeoftheregressionlineischangedsigniﬁcantly.
CHAPTER12. LINEARREGRESSIONANDCORRELATION
Computersand manycalculatorscanbe usedtoidentifyoutliersfromthe data. . Computeroutputfor
regressionanalysiswilloftenidentifybothoutliersandinﬂuentialpointssothatyoucanexaminethem.
IdentifyingOutliers
Wecouldguessatoutliersbylookingatagraphofthescatterplotandbestﬁtline.Howeverwewouldlike
someguidelineastohowfarawayapointneedstobeinordertobeconsideredanoutlier. Asaroughrule
ofthumb,wecanﬂaganypointthatislocatedfurtherthantwostandarddeviationsaboveorbelowthe
bestﬁtlineasanoutlier.Thestandarddeviationusedisthestandarddeviationoftheresidualsorerrors.
Wecandothisvisuallyinthescatterplotbydrawinganextrapairoflinesthataretwostandarddeviations
aboveandbelowthebestﬁtline. Anydatapointsthatareoutsidethisextrapairoflinesareﬂaggedas
potentialoutliers.Orwecandothisnumericallybycalculatingeachresidualandcomparingittotwicethe
standarddeviation. OntheTI-83,83+,or84+,thegraphicalapproachiseasier. Thegraphicalprocedure
isshownﬁrst,followedbythenumericalcalculations. Youwouldgenerallyonlyneedtouseoneofthese
methods.
Example12.12
Inthethirdexam/ﬁnalexamexample,youcandetermineifthereisanoutlierornot. Ifthereis
anoutlier,asanexercise,deleteitandﬁttheremainingdatatoanewline. Forthisexample,the
newlineoughttoﬁttheremainingdatabetter. ThismeanstheSSEshouldbesmallerandthe
correlationcoefﬁcientoughttobecloserto1or-1.
Solution
GraphicalIdentiﬁcationofOutliers
WiththeTI-83,83+,84+graphingcalculators,itiseasytoidentifytheoutliergraphicallyandvisu-
ally. Ifweweretomeasuretheverticaldistancefromanydatapointtothecorrespondingpoint
onthelineofbestﬁtandthatdistancewasequalto2sorfarther,thenwewouldconsiderthedata
pointtobe"toofar"fromthelineofbestﬁt. Weneedtoﬁndandgraphthelinesthataretwo
standarddeviationsbelowandabovetheregressionline. Anypointsthatareoutsidethesetwo
linesareoutliers.WewillcalltheselinesY2andY3:
Aswedidwiththe equationoftheregressionlineandthecorrelationcoefﬁcient,we willuse
technologyto calculatethisstandarddeviationforus. . UsingtheLinRegTTestwiththisdata,
scrolldownthroughtheoutputscreenstoﬁnds=16.412
LineY2=-173.5+4.83x-2(16.4)andlineY3=-173.5+4.83x+2(16.4)
where
^
y
=-173.5+4.83x is the line of best t ﬁt. . Y2 2 and d Y3 3 have the same e slope as s the line of
bestﬁt.
GraphthescatterplotwiththebestﬁtlineinequationY1,thenenterthetwoextralinesasY2and
Y3inthe"Y="equationeditorandpressZOOM9.Youwillﬁndthattheonlydatapointthatisnot
betweenlinesY2andY3isthepointx=65,y=175.Onthecalculatorscreenitisjustbarelyoutside
exam;thispointisfurtherthan2standarddeviationsawayfromthebestﬁtline.
Sometimesapointissoclosetothelinesusedtoﬂagoutliersonthegraphthatitisdifﬁculttotell
ifthepointisbetweenoroutsidethelines. Onacomputer,enlargingthegraphmayhelp;ona
smallcalculatorscreen,zoominginmaymakethegraphclearer. Notethatwhenthegraphdoes
notgiveaclearenoughpicture,youcanusethenumericalcomparisonstoidentifyoutliers.
Figure12.16
NumericalIdentiﬁcationofOutliers
Inthe table below, , the e ﬁrst two columns s are e the third examandﬁnalexamdata. . The e third
columnshowsthepredicted
^
y
valuescalculatedfromthelineofbestﬁt:
^
y
=-173.5+4.83x. The
residuals,orerrors, havebeencalculatedinthefourthcolumnofthetable: : observedyvalue
predictedyvalue=y
^
y
.
sisthestandarddeviationofallthey
^
y=
evalueswheren=thetotalnumberofdatapoints.If
deviationoftheresidualsiscalculatedfromtheSSEas:
s=
q
SSE
2
Ratherthancalculatethevalueofsourselves,wecanﬁndsusingthecomputerorcalculator.For
thisexample,thecalculatorfunctionLinRegTTestfound16.4asthestandarddeviationofthe
residuals35;-17;16;-6;-19;9;3;-1;-10;-9;-1.
CHAPTER12. LINEARREGRESSIONANDCORRELATION
x
y
^
y
y
^
y
65
175
140
175 140=35
67
133
150
133 15017
71
185
169
185 169=16
71
163
169
163 1696
66
126
145
126 14519
75
198
189
198 189=9
67
153
150
153 150=3
70
163
164
163 1641
71
159
169
159 16910
69
151
160
151 1609
69
159
160
159 1601
Table12.1
Wearelookingforalldatapointsforwhichtheresidualisgreaterthan2s=2(16.4)=32.8orlessthan
-32.8.Comparethesevaluestotheresidualsincolumn4ofthetable.Theonlysuchdatapointis
thisstudentis35.
Howdoestheoutlieraffectthebestﬁtline?
Numericallyandgraphically,wehaveidentiﬁedthepoint(65,175)asanoutlier. Weshouldre-
examinethedataforthispointtoseeifthereareanyproblemswiththedata. Ifthereisanerror
weshouldﬁxtheerrorifpossible,ordeletethedata. Ifthedataiscorrect,wewouldleaveitin
thedataset. Forthisproblem,wewillsupposethatweexaminedthedataandfoundthatthis
outlierdatawasanerror. Thereforewewillcontinueonanddeletetheoutlier,sothatwecan
explorehowitaffectstheresults,asalearningexperience.
Computeanewbest-ﬁtlineandcorrelationcoefﬁcientusingthe10remainingpoints:
OntheTI-83,TI-83+,TI-84+calculators,deletetheoutlierfromL1andL2.UsingtheLinRegTTest,
thenewlineofbestﬁtandthecorrelationcoefﬁcientare:
^
y=
355.19+7.39xandr=0.9121
Thenewlinewith0.9121isastrongercorrelationthantheoriginal(r=0.6631)because=
0.9121iscloserto1. Thismeansthatthenewlineisabetterﬁttothe10remainingdatavalues.
Thelinecanbetterpredicttheﬁnalexamscoregiventhethirdexamscore.
NumericalIdentiﬁcationofOutliers:CalculatingsandFindingOutliersManually
IfyoudonothavethefunctionLinRegTTest, thenyoucancalculatethe outlierintheﬁrst exampleby
doingthefollowing.
First,squareeachjy
^
y
j(SeetheTABLEabove):
Thesquaresare35
2
;17
2
;16
2
;6
2
;19
2
;9
2
;3
2
;1
2
;10
2
;9
2
;1
2
^
y
jsquaredtermsusingtheformula
11
S
i=1
jy
i
^
y
i
j
!
2
=
11
S
i=1
e
i
2
(Recallthaty
i
^
y
i
=e
i
.)
=35
2
+17
2
+16
2
+6
2
+19
2
+9
2
+3
2
+1
2
+10
2
+9
2
+1
2
=2440=SSE.Theresult,SSEistheSumofSquaredErrors.
Next,calculates,thestandarddeviationofallthey
^
y
evalueswheren=thetotalnumberofdata
points.
Thecalculationiss=
q
SSE
2
Forthethirdexam/ﬁnalexamproblem,s=
q
2440
11 2
=16.47
Next,multiplysby1.9:
(1.9)(16.47)=31.29
31.29isalmost2standarddeviationsawayfromthemeanofthey
^
y
values.
Ifweweretomeasuretheverticaldistancefromanydatapointtothecorrespondingpointonthelineof
bestﬁtandthatdistanceisatleast1.9s,thenwewouldconsiderthedatapointtobe"toofar"fromtheline
ofbestﬁt.Wecallthatpointapotentialoutlier.
Fortheexample, ifanyofthe jy
^
y
jvaluesareatleast31.29,thecorresponding(x,y)datapointisa
potentialoutlier.
Forthethirdexam/ﬁnalexamproblem,allthejy
^
y
j’sarelessthan31.29exceptfortheﬁrstonewhichis
35.
35>31.29
Thatis,jy
^
y
j(1.9)(s)
Thepointwhichcorrespondstojy
^
y
j=35is(65,175).Therefore,thedatapoint(65,175)isapotential
outlier.Forthisexample,wewilldeleteit.(Remember,wedonotalwaysdeleteanoutlier.)
The next step is s to o compute a a new best-ﬁt t line e usingthe e 10remaining g points. . The e new line ofbest
ﬁtandthecorrelationcoefﬁcientare:
^
y
355.19+7.39xandr=0.9121
Example12.13
Usingthisnewlineofbestﬁt(basedontheremaining10datapoints),whatwouldastudent
CHAPTER12. LINEARREGRESSIONANDCORRELATION
Solution
Usingthenewlineofbestﬁt,
^
y
355.19+7.39(73) 184.28.Astudentwhoscored73points
onthethirdexamwouldexpecttoearn184pointsontheﬁnalexam.
The original l line e predicted
^
y
173.51+4.83(73) =
179.08 so o the prediction n using the
newlinewiththeoutliereliminateddiffersfromtheoriginalprediction.
Example12.14
(FromTheConsumerPriceIndexesWebsite)TheConsumerPriceIndex(CPI)measurestheaver-
agechangeovertimeinthepricespaidbyurbanconsumersforconsumergoodsandservices.The
CPIaffectsnearlyallAmericansbecauseofthemanywaysitisused.Oneofitsbiggestusesisas
Congress,andtheFederalReserveBoardusetheCPI’strendstoformulatemonetaryandﬁscal
policies.Inthefollowingtable,xistheyearandyistheCPI.
Data:
x
y
1915
10.1
1926
17.7
1935
13.7
1940
14.7
1947
24.1
1952
26.5
1964
31.0
1969
36.7
1975
49.3
1979
72.6
1980
82.4
1986
109.6
1991
130.7
1999
166.6
Table12.2
Problem
 Makeascatterplotofthedata.
 Calculatetheleastsquaresline.Writetheequationintheform
^
y
=a+bx.
 Drawthelineonthescatterplot.
 Findthecorrelationcoefﬁcient.Isitsigniﬁcant?
 WhatistheaverageCPIfortheyear1990?
