Stat
Population
Probability
Sample
If we know #:
Stats
Sampling bias?
Sample
Population includes
interest. Data is
a subset of the
with an n.
Other bias?
Data
Data
collection.
Sample
Stat inference.
1.3
Association: Two variables are as if values of one tend to be related
to values of other in regular way.
Smoking and lung cancer
Road salt and accidents.
Two variable association: Are casually associated changing value of
one influences another. Cause and effect.
Colder temperature and higher heating cost.
Association
Twovariablesareassociatedifvaluesofonevariabletendtoberelatedtothevaluesof
theothervariable.
Causation
Twovariablesarecausallyassociatedifchangingthevalueofonevariableinfluences
thevalueoftheothervariable.
Fix this!!!!
NotationforaProportion
Theproportionforasampleisdenoted andreadphat.
Theproportionforapopulationisdenotedp.
TwoWayTable
Atwowaytableisusedtoshowtherelationshipbetweentwocategoricalvariables.The
categoriesforonevariablearelisteddowntheside(rows)andthecategoriesforthe
secondvariablearelistedacrossthetop(columns).Eachcellofthetablecontainsthe
countofthenumberofdatacasesthatareinboththerowandcolumncategories.
Outliers
Anoutlierisanobservedvaluethatisnotablydistinctfromtheothervaluesinadataset.
Usually,anoutlierismuchlargerormuchsmallerthantherestofthedatavalues.
CommonShapesforDistributions
Adistributionshowninahistogramordotplotiscalled:
Symmetricifthetwosidesapproximatelymatchwhenfoldedonaverticalcenterline
Skewedtotherightifthedataarepiledupontheleftandthetailextendsrelativelyfar
outtotheright
Skewedtotheleftifthedataarepiledupontherightandthetailextendsrelativelyfar
outtotheleft
Bellshapedifthedataaresymmetricand,inaddition,havetheshapeshownin
Figure2.9(c)
Ofcourse,manyothershapesarealsopossible.
Mean
Themeanofthedatavaluesforasinglequantitativevariableisgivenby
Median
Themedianofasetofdatavaluesforasinglequantitativevariable,denotedm,is
themiddleentryifanorderedlistofthedatavaluescontainsanoddnumberofentries,or
theaverageofthemiddletwovaluesifanorderedlistcontainsanevennumberof
entries.
Themediansplitsthedatainhalf.23
Resistance
Ingeneral,wesaythatastatisticisresistantifitisrelativelyunaffectedbyextreme
values.Themedianisresistant,whilethemeanisnot.
DefinitionofStandardDeviation
Thestandarddeviationforaquantitativevariablemeasuresthespreadofthedataina
sample:
Thestandarddeviationgivesaroughestimateofthetypicaldistanceofadatavaluefrom
themean.Thelargerthestandarddeviation,themorevariabilitythereisinthedataand
themorespreadoutthedataare.
NotationfortheStandardDeviation
Thestandarddeviationofasampleisdenoteds,andmeasureshowspreadoutthedata
arefromthesamplemean .
Thestandarddeviationofapopulation37isdenoted,whichistheGreeklettersigma,
andmeasureshowspreadoutthedataarefromthepopulationmean.
NumberofStandardDeviationsfromtheMean:zScores
Thezscoreforadatavalue,x,fromasamplewithmean andstandarddeviationsis
definedtobe
Forapopulation, isreplacedwithandsisreplacedwith.
Thezscoretellshowmanystandarddeviationsthevalueisfromthemeanandis
independentoftheunitofmeasurement.
Percentiles
ThePthpercentileisthevalueofaquantitativevariablewhichisgreaterthanPpercent
ofthedata.39
FiveNumberSummary
Wedefine
where
Thefivenumbersummarydividesthedatasetintofourths:about25%ofthedatafall
betweenanytwoconsecutivenumbersinthefivenumbersummary.
RangeandInterquartileRange
Fromthefivenumbersummary,wecancomputethefollowingtwostatistics:
DetectionofOutliers
Asageneralruleofthumb,wecalladatavalueanoutlierifitis
SmallerthanQ11.5(IQR)orLargerthanQ3+1.5(IQR)
Correlation
Thecorrelationisameasureofthestrengthanddirectionoflinearassociationbetween
twoquantitativevariables.
NotationfortheCorrelation
Thecorrelationbetweentwoquantitativevariablesofasampleisdenotedr.
Thecorrelationbetweentwoquantitativevariablesofapopulationisdenoted,whichis
theGreekletterrho.
TheregressionlinetopredictyfromxisNOTthesameastheregressionlineto
predictxfromy.Besuretoalwayspayattentiontowhichistheexplanatoryvariableand
whichistheresponsevariable!Aregressionlineisalwaysintheform
Theresidualatadatavalueisthedifferencebetweentheobservedandpredictedvalues
oftheresponsevariable:
Onascatterplot,theresidualrepresentstheverticaldeviationfromthelinetoadata
point.Pointsabovethelinewillhavepositiveresidualsandpointsbelowthelinewill
havenegativeresiduals.Ifthepredictedvaluescloselymatchtheobserveddatavalues,
theresidualswillbesmall.
Fortheregressionline
,
Theslopebrepresentsthepredictedchangeintheresponsevariableygivenaoneunit
increaseintheexplanatoryvariablex.
Theinterceptarepresentsthepredictedvalueoftheresponsevariableyiftheexplanatory
variablexiszero.Theinterpretationmaybenonsensicalsinceitisoftennotreasonable
fortheexplanatoryvariabletobezero.
StatisticalInference
Statisticalinferenceistheprocessofdrawingconclusionsabouttheentirepopulation
basedontheinformationinasample.
ParametersvsStatistics
Aparameterisanumberthatdescribessomeaspectofapopulation.
Astatisticisanumberthatiscomputedfromthedatainasample.
PointEstimate
Weusethestatisticfromasampleasapointestimateforapopulationparameter.
SampleSizeMatters!
Asthesamplesizeincreases,thevariabilityofsamplestatisticstendstodecreaseand
samplestatisticstendtobeclosertothetruevalueofthepopulationparameter.
IntervalEstimateBasedonaMarginofError
Anintervalestimategivesarangeofplausiblevaluesforapopulationparameter.One
commonformofanintervalestimateis
Pointestimatemarginoferror
wherethemarginoferrorisanumberthatreflectstheprecisionofthesamplestatistic
asapointestimateforthisparameter.
ConfidenceInterval
Aconfidenceintervalforaparameterisanintervalcomputedfromsampledatabya
methodthatwillcapturetheparameterforaspecifiedproportionofallsamples.
Thesuccessrate(proportionofallsampleswhoseintervalscontaintheparameter)is
knownastheconfidencelevel.
95%ConfidenceIntervalUsingtheStandardError
IfwecanestimatethestandarderrorSEandifthesamplingdistributionisrelatively
symmetricandbellshaped,a95%confidenceintervalcanbeestimatedusing
Statistic2SE