Anda di halaman 1dari 40

4/2/2017 2:PointEstimationandIntervalEstimation

HOMEHELPPREFERENCES

Search TitlesAZ Filenames

CONTRACTTEXT
EXPANDCONTENTS
DETACH
NOHIGHLIGHTING
StatisticalInference
sourceref:ebook.html

Metadata
1:Populations,Samples,EstimatesAndRepeatedSampling
2:PointEstimationandIntervalEstimation
3:ResultsFromProbabilityTheory

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 1/40
4/2/2017 2:PointEstimationandIntervalEstimation

4:OneContinuousVariable
5:ComparingMeansofTwoContinuousVariables
6:InferenceForCountdata
7:MoreOnCorrelationAndRegression
8:Summary

StatisticalInference

Metadata

Metadata

Titile:StatisticalInference
Author:Surfsata

Publisher:Surfstat

Date:1999

1:Populations,Samples,EstimatesAndRepeatedSampling

STATISTICALINFERENCE
POPULATIONS,SAMPLES,ESTIMATESANDREPEATEDSAMPLING
Statisticalinferenceistheuseofprobabilitytheorytomakeinferencesaboutapopulationfromsampledata.

Supposewewanttoestimatethecharacteristicsofapopulationsuchastheaverageweightofall30yearoldwomeninAustralia,orthepercentageofvotersin
N.S.W.whothinktheGovernmentisdoingagoodjobtocontrolinflation.

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 2/40
4/2/2017 2:PointEstimationandIntervalEstimation

Inpracticewecannotobtaindatafromeverymemberofthepopulation.Instead,weobtaindatafromasampleandusetheresultstomakeinferencesaboutthe
population.

Definitions
Populationcollectionofallsubjectsorobjectsofinterest(notnecessarilypeople)

Samplesubsetofthepopulationusedtomakeinferencesaboutthecharacteristicsofthepopulation

Populationparameternumericalcharacteristicofapopulation,afixedandusuallyunknownquantity.

e.g.averageweight,,ofall30yearoldwomeninAustralia,

e.g.%ofvoters,p,inN.S.WwhothinktheGovernmentisdoingagoodjobtocontrolinflation.

Datavaluesmeasuredorrecordedonthesample.

Samplestatisticnumericalcharacteristicofthesampledatasuchasthemean,proportionorvariance.Itcanbeusedtoprovideestimatesofthecorresponding
populationparameters.

e.g.%ofvotersinanopinionpollinSydneywhothinktheGovernmentisdoingagoodjobtocontrolinflation.

Differentsamplesgivedifferentvaluesforsamplestatistics.Bytakingmanydifferentsamplesandcalculatingasamplestatisticforeachsample(e.g.thesample
mean),youcouldthendrawahistogramofallthesamplemeans.Astatisticfromasampleorrandomisedexperimentcanberegardedasarandomvariable
andthehistogramisanapproximationtoitsprobabilitydistribution.Thetermsamplingdistributionisusedtodescribethisdistribution,i.e.howthestatistic
(regardedasarandomvariable)variesifrandomsamplesarerepeatedlytakenfromthepopulation.

Bias

Ifthesamplingdistributionisknownthentheabilityofthesamplestatistictoestimatethecorrespondingpopulationparametercanbedetermined.

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 3/40
4/2/2017 2:PointEstimationandIntervalEstimation

Inparticular,thesamplingdistributiondeterminestheexpectedvalueandvarianceofthesamplingstatistic.Iftheexpectedvalueofthestatisticisequaltothe
populationparameter,theestimatorisunbiased.Ifthevarianceofthestatisticis'small'anditisalsounbiasedthenanobservedstatisticislikelytobecloseto
thepopulationparameter.

Bias=distancebetweenparameterandexpectedvalueofsamplestatistics

Subsequently,samplestatisticscanbeclassifiedasshowninthefollowingdiagrams.

1.

Estimateshavelowbiasbecausetheiraverageisnearthepopulationparameter,buthavehighvariabilitybecausetheyarewidelyspreadandasingle
samplevaluecouldbefarfromtheparameter.

2.

Estimateshavebiasbecausetheexpectedvalueisnotequaltotheparameter.

Theyalsohavehighvariabilitybecausetheyarewidelyspreadout.

3.

Inthiscasetheestimatesarebiasedbecauseallofthemaresystematicallyhigherthanthepopulationparameter

Thesamplestatisticshave,however,lowvariabilitybecausetheyareallclosetogether.

4.

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 4/40
4/2/2017 2:PointEstimationandIntervalEstimation

Inthiscasetheestimateshavebothlowbiasandlowvariability.Experimentaldesignaimstosimultaneouslyreducebiasandvariabilitybyproducinga
samplingdistributionasshownin4.

Ingeneral

samplestatistic=populationparameter+bias+chancevariation

Inferencesaboutthecharacteristicsofapopulationarebasedondatafromasample.

Ifthesampleisnotrepresentativeofthepopulationbeingstudied,thesamplestatisticmaybebiasedsoyoucannotuseittomakevalidinferencesabout
thepopulationparameter
Tominimisebiasthesampleshouldbechosenbyrandomsamplingfromalistofallindividualsintherelevantpopulation.Thislistiscalledthe
samplingframe.Itisessential.
Forasimplerandomsampletheindividualsarechoseninsuchawaythateachindividualinthesamplingframehasanequalchanceofbeingselected.
Thismayinvolveusingcomputergeneratedrandomnumberstoselectthesample.

ExampleHealthsurveyconductedintheHunterRegion

StudypopulationallresidentsofthelowerHunterRegion(Newcastle,LakeMacquarie,PortStephens,Maitland,Cessnock)aged2569years.

Samplingframeelectoralroll(note:somebiasisintroducedhere:youngerpersons(<35years)andmigrantsarelesslikelytobeontheroll).

Sampleselectionsamplechosenusingcomputergeneratedrandomnumberssoeachpersonontheelectoralrollinthisagegrouphasa1in100chanceof
selection.

Actualsamplethosewhorespondedtotherequesttoparticipateinthestudy

Nonrespondentsmaydifferfromtherespondentsinmanyways(e.g.beinglesshealthy)andthiscouldleadtobiasinestimatesoftheproportionofsmokers,
averageweight,etc.

ExampleHeightsofwomen,2529
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 5/40
4/2/2017 2:PointEstimationandIntervalEstimation

Supposeyoumeasuredtheheightsofarandomsampleof100womenaged2529yearsandcalculated

samplemean =165cms

samplestandarddeviations=5cms

Whatcanyouconcludeabouttheheightsofallwomeninthispopulationaged2529years?

Supposinganybiasisnegligiblysmall,thepopulationmeanisapproximately165cms.Buthowcloseistheapproximation?Howmighttheestimatehave
variedifadifferentrandomsamplehadbeenselected?

Inparameterestimation,weassumethatthedistributionofthevariableweareinterestedinisadequatelydescribedbyadistributionwithoneormore
(unknown)parameters.Weattempttoestimatethepopulationparameterusingthesampledata.Toemphasisethedifferencebetweensampleandpopulation,
theparameterswewishtoestimatearecalledpopulationparameters.Estimatesofthepopulationparametersobtainedfromasamplearecalledsamplestatistics
(orsampleestimates).

2:PointEstimationandIntervalEstimation

POINTESTIMATIONANDINTERVALESTIMATION
Inanyestimationproblem,weneedtoobtainbothapointestimateandanintervalestimate.Thepointestimateisourbestguessofthetruevalueof
theparameter,whiletheintervalestimategivesameasureofaccuracyofthatpointestimatebyprovidinganintervalthatcontainsplausiblevalues.

SampleMean

Whenthevariableofinterestisquantitative,thesamplemean providesapointestimateoftheunknownpopulationmean.Whenthevariablehasa
binomialdistribution,thesampleproportionisapointestimateoftheunknownpopulationproportionp.

ConfidenceInterval
Confidenceintervalarefrequentlyusedasintervalestimates.Articlesintheresearchliteraturecommonlyreport95%confidenceintervals(95%CI).
The95%CIiscalculatedinsuchawaythatunderrepeatedsamplingitwillcontainthetruepopulationparameter95%ofthetime.Thefigure95%is
ameasureoftheconfidenceyouhavethattheintervalcontainsthetruepopulationparameter.Confidenceintervalsmaybecomputedforanylevelof

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 6/40
4/2/2017 2:PointEstimationandIntervalEstimation

confidence:90%or99%confidenceintervalsaresometimesused.Wewillexaminethistopicinmoredetailafterabriefoverviewofsomeresultsfrom
probabilitytheorywhichareuseful.

3:ResultsFromProbabilityTheory

RESULTSFROMPROBABILITYTHEORY
Sincesamplingtheory(andhencestatisticalinference)asusedinthiscoursereliesontheconceptofrepeatedsamples,weconsiderthreeresultsfrom
probabilitytheory:lawofaverageslawoflargenumbersCentralLimitTheorem

RecallKerrich'scointossingexperiment

In10,000tossesofacoinyou'dexpectthenumberofheads(#heads)toapproximatelyequalthenumberoftails

so#heads #tosses

Fig.1showsthat(#heads#tosses)canbecomelargeinabsolutetermsasthenumberoftossesincreases

Fig.2showsthatinrelativeterms

%ofheads50%>0

as#tossesincreases

Youcanthinkofthisas

#heads=#tosses+chanceerror

wherechanceerrorbecomeslargeinabsolutetermsbutsmallas%of#tossesas#tossesincreases.

Figure1.Kerrich'scointossingexperiment.Aplotofthe'chanceerror'numberofheadshalfthenumberoftossesagainstthenumberof
tosses.Asthenumberoftossesgoesup,thesizeofthechanceerrortendstogoup.Thehorizontalaxisisnotdrawntoscale.

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 7/40
4/2/2017 2:PointEstimationandIntervalEstimation

Figure2.Thechanceerrorexpressedasapercentageofthenumberoftosses.Asthenumberoftossesgoesup,thispercentagegetssmaller.In
otherwords,thechanceerrorgetssmallerrelativetothenumberoftosses.Thehorizontalaxisisnotdrawntoscale.

LawofAverages
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 8/40
4/2/2017 2:PointEstimationandIntervalEstimation

TheLawofAveragesstatesthatanaverageresultfornindependenttrialsconvergestoalimitasnincreases.

Thelawofaveragesdoesnotworkbycompensation.Arunofheadsisjustaslikelytobefollowedbyaheadasbyatailbecausetheoutcomesof
successivetossesareindependentevents.

Wecanprovethisresultusingresultsfromthebinomialdistribution.

LetRVXbethenumberofheadsinntosses.

Expectedvaluefortheproportionofheads=0.5withvariance(0.5)2/n,whichgoestozeroasn>

LawofLargeNumbers

If areindependentrandomvariablesallwiththesameprobabilitydistributionwithexpectedvalueandvariance2then

isverylikelytobecomeveryclosetoasnbecomesverylarge.

Cointossingisasimpleexample.

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 9/40
4/2/2017 2:PointEstimationandIntervalEstimation

Thusthelawofaveragesisaninformalversionofthelawoflargenumbers

Lawoflargenumberssaysthat

andvar( )becomesverysmall

Infact,usingtheresultvar(X+Y)=var(X)+var(Y)ifXandYareindependent,then

Inparticular,iftherepeatedindependentmeasurementsareallNormallydistributed,thatis

thentheirmeanalsohastheNormaldistribution

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 10/40
4/2/2017 2:PointEstimationandIntervalEstimation

TheCentralLimitTheoremsaysthat isapproximatelyNormallydistributedeveniftheoriginalmeasurementswerenotNormally
distributed.

CentralLimitTheorem

Thedistributionofsamplemeans approaches

regardlessoftheshapeoftheprobabilitydistributionsofX1,X2,....

4:OneContinuousVariable

ONECONTINUOUSVARIABLE
SamplingDistributionoftheSampleMean

SupposewehaveasampleofnobservationsofacontinuousrandomvariableX(e.g.height,cost,temperature)

Let=E(X)bethepopulationmean,and= bethepopulationstandarddeviationofX.

Usuallybothandareunknown,andwewantprimarilytoestimate.

Fromthesamplewecancalculate =samplemeans=samplestandarddeviation

Thesamplemeanisanestimateof,buthowaccurateisit?

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 11/40
4/2/2017 2:PointEstimationandIntervalEstimation

Weknowtheapproximatesamplingdistributionoftherandomvariable fromtheCentralLimitTheorem.Supposemanydifferentsamplesofthe
samesizeareobtainedbyrepeatedlysamplingfromapopulation

foreachsample iscalculatedand
ahistogramofthe valuesisdrawn

Theshapeofthehistogramdependsonthesizenofthesample,andapproximatestothesamplingdistribution.

PropertiesoftheSamplingDistribution

i.SamplingDistributionsof havethesameexpectedvalue,regardlessofsamplesizen,equalto.
ii.EvenifthedistributionoftheXvalueswasnotNormal,asnincreasesthedistributionfor becomesmoreliketheNormaldistribution
thisistheCentralLimitTheorem
iii.Asnincreases,thedistributionof becomesnarrowerthatis,the 'sclustermoretightlyaround.Infactthevarianceisinversely
proportionalton:

Thesquarerootofthisvariance, ,iscalledthe"standarderror"of :SE( )= .

Thesamplingdistributionof is(approx)

Thisfollowsfromthepreviousdefinitionofvarianceofcombinationsofrandomvariables,since

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 12/40
4/2/2017 2:PointEstimationandIntervalEstimation

ConfidenceLimitsforthePopulationMeanwhenisknown

Tocalculateprobabilitiesfor use

e.g.fromtableforZ~N(0,1),

P(Z<1)=.8413

soP(1<Z<1)=.8413(1.8413)=.6826.

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 13/40
4/2/2017 2:PointEstimationandIntervalEstimation

Sowithprobability0.6826,1<Z<1,Equivalently,withprobability.6826, .

Wewanttorearrangetheseinequalitiessothataloneisinthemiddle.Ifyourearrangeastatementthathasacertainprobabilityofbeingtrue,
thenewequationstillhasthatsameprobability.

First,multiplyeachtermby toobtain

Multiplyeachtermby1(thismeansinequalitysignsshouldalsobereversed)

Add toeachterm

Rearrangethistoobtain

Thistellsusthat68.3%ofallsampleswillcontainthepopulationmeanbetweentherandominterval

Liscalledthelower68%confidencelimitforandUiscalledtheupper68%confidencelimitfor

Moreusually90%or95%or99%confidencelimitsareused.

E.g.FortablesforN(0,1)

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 14/40
4/2/2017 2:PointEstimationandIntervalEstimation

P(Z<1.96)=0.975

HenceP(1.96<Z<1.96)=0.95

Thiscanberearrangedasbeforetoobtain95%confidencelimitsfor.

Similarly,90%confidencelimitsforare 1.645 and99%cofidencelimitsforare 2.575 .

Theprobabilitythatliesintherandominterval 1.96 is0.95,whereistherandomvariablefromwhicheachsamplemeanisan


observation.

Ifweobserveasamplemean, ,thentheinterval1.96 isthe95%C.I.for.However,theintervalistheobservedinterval(isnotrandom)

andhenceiteitherdoesordoesnotcontain.Thus,thestatementthat 1.96 containswithprobability0.95isincorrect,becausethere


isnorandomvariableinthisstatementtowhichtheprobabilitycanrefer.

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 15/40
4/2/2017 2:PointEstimationandIntervalEstimation

Instead,wesaythattheinterval 1.96 containsplausiblevaluesforthatareconsistentwiththesampledata,orthatweare95%

confidentthatliesintheinterval 1.96 .95%confidentdoesnotrefertotheprobabilityfortheobservedinterval,buttothenotionof


repeatedsampling.

Inpractice,isusualllyunknown,butitcanbeestimatedbythesamplestandarddeviation,s.Ifthisisdone,thetdistributionmustbeused
insteadofthestandardnormal.

Healthsurveyexamplecontinued:

Sampleofn=100womenaged2529years

samplemean =165cms
samplestandarddeviations=5cmsFornow,wewillpretendthatthepopulationSDisknowntobeexactly5.Amoreaccuratemethodmust
takeaccountofthefactthatthevalue5isonlyanestimate.Thismethodisbasedonthetdistributioninsteadofthenormaldistribution.The
differenceissmallwhenthesampleislarge,ashere.

95%confidencelimitsforpopulationmean,are

165+1.96 =165.98 166

1651.96 =164.02 164

i.e.95%confidenceintervalforis(164,166)cms

Hence,plausiblevaluesforare164166cms,orwith95%confidencethetruestudypopulationmeanheightofwomenaged2529yearslies
between164and166cms.

5:ComparingMeansofTwoContinuousVariables

COMPARINGMEANSOFTWOCONTINUOUSVARIABLES
Introduction

ExampleFromtheMINITABhandbookpage101
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 16/40
4/2/2017 2:PointEstimationandIntervalEstimation

"Ashoecompanywantedtocomparetwomaterials(AandB)foruseonthesolesofboys'shoes.Wecoulddesignanexperimenttocomparethetwo
materialsintwoways.Onedesignmightbetorecruittenboys(ormoreifourbudgetallowed)andgivefiveoftheboysshoeswithmaterialAandgive
fiveboysshoeswithmaterialB.Thenafterasuitablelengthoftime,saythreemonths,wecouldmeasurethewearoneachboy'sshoes.Thiswouldlead
toindependentsamples.Now,youwouldexpectacertainvariabilityamongtenboyssomeboyswearoutshoesmuchfasterthanothers.Aproblem
arisesifthisvariabilityislarge.Itmightcompletelyhideanimportantdifferencebetweenthetwomaterials.

Analternativedesign,apaireddesign,attemptstoremovesomeofthisvariabilityfromtheanalysissowecanseemoreclearlyanydifferencesbetween
thematerialswearestudying.Again,supposewestartedwiththesametenboys,butthistimehadeachboytestbothmaterials.Thereareseveralways
wecoulddothis.EachboycouldwearmaterialAforthreemonths,thenmaterialBforasecondthreemonths.Orwecouldgiveeachboyaspecial
pairofshoeswiththesoleononeshoemadefrommaterialAandtheotherfrommaterialB.Thislatterprocedureproducedthedateinthetable
below:"

Boy MaterialA MaterialB


1 13.2 14.0
2 8.2 8.8
3 10.9 11.2
4 14.3 14.2
5 10.7 11.8
6 6.6 6.4
7 9.5 9.8
8 10.8 11.3
9 8.8 9.3
10 13.3 13.6

MTB>DotPlot'materA''materB';
SUBC>Same.

..........
++++++materA

....:....
++++++materB
6.07.59.010.512.013.5

MTB>descc1c2

NMEANMEDIANTRMEANSTDEVSEMEAN
materA1010.63010.75010.6752.4510.775
materB1011.04011.25011.2252.5180.796

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 17/40
4/2/2017 2:PointEstimationandIntervalEstimation

MINMAXQ1Q3
materA6.60014.3008.65013.225
materB6.40014.2009.17513.700

Thefirstdesignresultsintwoindependentsamples,whiletheseconddesigncontainsdependentsamples.Differentmethodsofanalysisareappropriate
ineachcase.

IndependentSamples
Supposewehavedatafromtwosamples,sizen1andn2,whichcomefromtwopopulationswhichmayormaynothavethesamemeans:

R.V.forsampledata Populationparameters

X1,X2,...,Xn1 X,X

Y1,Y2,...,Yn1 Y,Y

Tocomparethepopulationmeansxandyyoucouldfindaconfidenceintervalfor(XY)ortestthenullhypothesis

H0:X=Y,i.e.XY=0againstthealternativehypothesis

H1:X Y,i.e.XY 0

ToestimateXYusethedifferencebetweenthesamplemeans( )

Toobtainthesamplingdistributionof( )assumethatXandYareindependent.

Usingresultsonthesamplingdistributionofameanweobtain

~N(x, ), ~N(y, )

Then

E( )=E( )E( )=XY

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 18/40
4/2/2017 2:PointEstimationandIntervalEstimation

var( )=var( )+var( )= +

ifXandYareindependent.

Also( )isapproximatelyNormallydistributedsothat

orZ=

InpracticeXandYareusuallyunknownsoyouwouldusesamplestandarddeviationssXandsYtoestimateXandYTherearetwocasesto
consider.

Case1(ttest)

Ifweassumethat2X 2Y=2thenagoodestimatefor2isthepooledsamplevariance

wheresX2andsY2werecalculatedusingdenominators(n11)and(n21).

Thenthefollowingtstatisticcanbeusedtoconstruct95%C.I.orpvalues.

t=

with(n1+n22)degreesoffreedom.

Case2(Welch'stest)

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 19/40
4/2/2017 2:PointEstimationandIntervalEstimation

If2Xand2Yareunlikelytobeequalyoushoulduse

t=

with degreesoffreedomwhere

Whendoingcalculationswithoutacomputer,itcanbetimeconsumingtocalculatethedegreesoffreedomforWelsh'sttest.Inthissituation,thet
distributionwithdegreesoffreedomequaltothesmallerofn11andn21canbeusedtofindcriticalvalues.

Asthesamplesizesincrease,probabilityvaluesbasedonthisassumptionbecomemoreaccurate.

NOTE:thismethodwillnotbeusedinthiscourse.WewillassumeCase1applies.

Inthiscourse,wewillassumethatthepopulationvariancesareequal2X=2Yanduseapooledestimateofvariance.

ExampleBoys'shoes

Forboy'sshoes

=10.63,sX=2.451,n1=10

=11.04,sY=2.518,n2=10

Ifweassumethatthetwosamplesofshoewearmeasurementswereindependent,thenthefollowinganalysiscouldbedone.Infact,sincethedatain
thetwosamplesarenotindependent,becausethesameboyswereusedeachtime,thecorrectanalysiswouldbetouseapairedsamplesmethod.

TestH0:X=YversusH1:X Y
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 20/40
4/2/2017 2:PointEstimationandIntervalEstimation

AssumeX=Yandassumeindependentsamplesandsouse

NotespooledliesbetweensXandsYasrequired.

Tocalculatetheteststatistic,assumeH0istrue.IfH0istruethenXY=0sotheobservedvalueofthetstatisticis

thiststatistichas10+102=18degreesoffreedom.

Thepvalueistheprobabilityofobservingthevalue0.37,oranevenmoreextremevalue,ineitherdirection

p=Pr(t18<0.37ort18>0.37)

=2P(t18>0.37)

>2P(t18>0.688)(0.688isthenearesttabulatedvalue)

>20.25>0.5

Asthepvalueisgreaterthan0.05(ie.itisnotsmall)donotrejectH0.Thedifferencebetweenthesamplemeansisnotstatisticallysignificant.

ConclusionThedataareconsistentwithH0:X=Y

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 21/40
4/2/2017 2:PointEstimationandIntervalEstimation

i.e.thereisnoevidencethattheaveragewearisdifferentformaterialsAandB

Remarks
TheMINITABcommandis"TWOSAMPLE".(Welch'stestisusedifthesubcommandPOOLEDisnotincluded.)

MTB>twosamplec1c2;
SUBC>pooled.

TWOSAMPLETFORmaterAVSmaterB

NMEANSTDEVSEMEAN
materA1010.632.450.78
materB1011.042.520.80

95PCTCIFORMUmaterAMUmaterB:(2.75,1.93)

TTESTMUmaterA=MUmaterB(VSNE):T=0.37P=0.72DF=18

POOLEDSTDEV=2.49

The95%C.I.forX=Ycanbeconstructedinusualway:

FindthevaluecsuchthatPr(c<t18<c)=0.95.

Table2givesavalueofc=2.101fortwith18d.f.

95%CI=( )2.101spooled

=(10.6311.04)2.1012.485

=0.412.33
=(2.74,1.92)

Thevalue0iscontainedinthisintervalandhencedataareconsistentwithXY=0andsoprovidenoevidenceforadifferenceinthe
populationmeans.

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 22/40
4/2/2017 2:PointEstimationandIntervalEstimation

6:InferenceForCountdata

INFERENCEFORCOUNTDATA
SamplingDistributionofaSampleProportion
Supposeyouareinterestedinapopulationinwhicheveryindividualcanbeclassifiedintooneoftwocategories,e.g.

manufactureditems:defective,OK
experimentalanimals:dead,alive
examresults:pass,fail

Ingeneral,wecallthese"success"and"failure".Wewanttoarriveatconclusionsaboutp,theproportionofsuccessesinthepopulation,using
informationfromasample.

Supposeyoutakearandomsampleofnindividuals.LettherandomvariableXbethenumberofsuccessesinthesample.Areasonablemodelwould
betoassumethatXhasthebinomialdistribution,X~binomial(n,p)whereusuallyn,thesamplesize,isknownbutpisnot.

Theproportionofsuccessesinthesample, ,providesanestimateforp(thepopulationproportion).Writethisas (where denotes"an


estimateofp").

IfX~binomial(n,p)thenE(X)=npandvar(X)=npqwhereq=1p.

Since isafunctionofX,andnisknown,wehave

sothestandarderrorof is

SE( )describesthevariabilityofestimates derivedfromdifferentsamplesfromthesamepopulation.

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 23/40
4/2/2017 2:PointEstimationandIntervalEstimation

Ifpandqareestimatedbythesamplevalues and =1 ,thenapproximately

ThebinomialdistributionofXcanbeapproximatedbytheNormaldistributionN(np,npq),providedn>20andnp>5.

Hencethesamplingdistributionof isapproximatelyNormal: N(p,pq/n),

andZ= N(0,1).Thisgivesusabasisformakinginferencesaboutp:constructingconfidenceintervalsandperforminghypothesistests.

ConfidenceIntervalsandtestsforProportions
Tofinda95%CIforp,wefirstneedtofindcsuchthatPr(c<Z<c)=0.95.UsingTable1,c=1.96.Thiscanbe
writtenas:

P(1.96<Z<1.96)=0.95where

Z= .

soPr(1.96< <1.96)=0.95.

Thiscanberearrangedtoobtainthe95%confidencelimitsforp.A95%CIforp,thetruepopulationproportion,is 1.96SE( )

where istheobservedsampleproportionX/n.

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 24/40
4/2/2017 2:PointEstimationandIntervalEstimation

ExampleConfidenceinterval

Topredicttheoutcomeofareferendumanopinionpollisconducted.Arandomsampleof552peopleontheelectoralrollareasked"willyouvote
for....?"Theiranswersarerecordedaseither"yes"or"other"("other"includes"no","don'tknow",etc).

If239peoplesay"yes"finda95%confidenceintervalforthetruepopulationproportionof"yes"votes.

Theproportionof'yes'sinthesampleis

=239/552=0.433

so =10.433=0.567andtheestimatedstandarderroris

SE( ) = =0.021

Soa95%confidenceintervalforthepopulationproportionpisgivenby

1.96 =0.4331.960.021,i.e.(0.39,0.47).

Thuswith95%confidence,thetrueproportionof"yes"votesliesbetween39%and47%.

ExampleSamplesize

InmostAustraliannationalelections,about50%ofvotesarecastfortheLiberal/NationalCoalitionand50%fortheAustralianLabourParty.How
largeasampledoyouneedtoestimatethevotetowithin3%with95%confidence?

LetpbetheproportionofLiberal/Nationalvotes.

Theinterval 1.96SE( )willcontainpwith95%confidence

Ifwetaketheterm1.96SE( )asthe3%(or0.03)error,weneed

1.96SE( )=.03

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 25/40
4/2/2017 2:PointEstimationandIntervalEstimation

soSE( )=0.03/1.96=0.0153

Usingtheestimatethatapproximately50%ofvotesareforLiberal/NationalParty,then =0.5andso =0.5.

SE( )= =

Hence, =0.0153

and =32.67.

whichgivesn=(32.67)2=1067.Thisisthereasonwhymostopinionpollsusesamplesofabout1000subjects.

ExampleHypothesisTest

Acompanyclaimstohave40%ofthemarketforsomeproduct.Youconductasurveyandfind38outof112buyers(i.e.34%)purchasedthis
brand.Arethesedataconsistentwiththecompany'sclaimorisyoursurveyresultof34%significantlydifferenttothecompany'sclaimof40%?

LetXbethenumberofbuyerswhochoosethisbrand.

AssumeX~Binomial(n,p)withn=112sincethebuyereitherpurchasesthisbrandornot(binomial).

Hypothesis:p=0.4(i.e.truepopulationproportionis40%).

IfthisistruethenX~Binomial(112,0.4)andsoE(X)=np=1120.4=44.8

Nowtocalculatethepvalueforthetest,wemustcalculatetheprobabilityofobtainingtheobservedvalueof38,oranevenmoreextremevalue,in
eitherdirectionfromtheexpectedvalueof44.8.

ValuesofXthatareeitherlessthanorequalto38,orgreaterthanorequalto51.6(=44.8+6.8),areasextremeormoreextremethantheobserved
value,so

pvalue=P(X 38orX 51.6)=P(X=0)+P(X=1)+...+P(X=38)+P(X=52)+...+P(X=112),

Thiscouldbecalculatedusingtheexactbinomialformulatoobtaintheresult0.2102.

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 26/40
4/2/2017 2:PointEstimationandIntervalEstimation

Alternatively,sincenislarge,wemightusetheNormalapproximationtotheBinomialdistribution

Xapprox~N(np,npq)

np=112.4=44.8

npq=112.4.6=26.88

i.e.X~approxN(44.8,26.88)

pvalue=P(X 38orX 51.6)orusingthecontinuitycorrection,

=P(X<38.5orX 51.5)
=P(Z orZ )
=P(Z 1.215orZ 1.292)
=.1122+(1.9015)
=.210

Thispvalueisgreaterthan0.05,soiftheconventionalsignificancelevelof0.05ischosentheresultis"notstatisticallysignificant".Thepvalue
isnotsmallsowedonotrejectH0butinsteadconcludethatthedataareconsistentwiththeclaimthatthetruepopulationproportionofbuyers
purchasingthisbrandis40%.Thesampleproportiondifferedfromthis,butnotbyastatisticallysignificantamount.

ConfidenceIntervalfortheDifferenceBetweenTwoProportions

ExampleCity/countrymarketsurvey

Inamarketsurveyitisfoundthat51of198(26%)ofpeopleincitiesusedbrand'A'ofaproductand26of145(18%)ofcountrypeopleuseit.
Canweconcludethatbrand'A'reallyismorepopularincitiesorcouldthisdifferencehaveoccurredbychance,i.e.duetosampling
variability?

Ingeneralwemaywishtocomparethepopulationproportionsbasedondatafromsamples

Group1 Group2

populationproportions p1 p2

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 27/40
4/2/2017 2:PointEstimationandIntervalEstimation

sampleproportions
1= 2=

standarderrors

Wecomparep1andp2bymakinginferencesabouttheirdifference(p1p2).Thisisestimatedby

Toobtainthesamplingdistributionfor weusetheformulasforexpectedvaluesandvariancesoffunctionsofrandomvariables.

soE = E(X1) E(X2)

= =p1p2

Ifp1,q1,p2andq2arereplacedbysampleestimatesthen

Alsoprovidedbothsamplesarelarge(sayn1>20andn2>20)andneitherp1norp2isverynear0or1,thenthesamplingdistributionof
isapproximatelyNormal.

So

Hencea95%confidenceintervalfor(p1p2)isgivenby

1.96

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 28/40
4/2/2017 2:PointEstimationandIntervalEstimation

Forthedataintheexampleabove

1=51/198 0.2576, 2=26/145 0.1793

soa95%confidenceintervalis

(0.25760.1793)1.96

i.e. (0.01,0.17)

Thusthedatasuggestthat,with95%confidence,thetruepopulationdifferenceinproportionswhofavourbrand'A'betweencitiesandcountry
areasisbetween1%and17%.

Inparticular,sincethisintervalincludeszero,thedataareconsistentwiththerebeingnodifferenceinpreferences(theyarealsoconsistentwith
differencesof10%or17%or1%,etc).Thenullhypothesisof"nodifference"cannotberejectedandwecannotconcludethatthereisanyreal
differenceinpreferences.

7:MoreOnCorrelationAndRegression

MOREONCORRELATIONANDREGRESSION
UseofCorrelationandRegression
Dataforonegrouporsampleconsistoftwocontinuousmeasurementsoneachsubject(orobject)

Data:(x1,y1),(x2,y2),...,(xn,yn).

Plotyagainstx.

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 29/40
4/2/2017 2:PointEstimationandIntervalEstimation

CorrelationandCorrelationcoefficient
Correlationmeasurestheextentoflinearassociationbetweenxandy.The"correlationcoefficient"isgivenby:

Incalculatingacorrelation,xandyaretreatedinterchangeably.Bothmeasurementsareassumedtobesubjecttorandomvariation.

Example20femalestudentsinSTAT101

Thedatabelowaretheheights(cm)andweights(Kg)of20femalestudentstakingSTAT101

ROWfhtfwt

116760
216465
317064
416347
515246
616057
717057
816055
915755
1017065
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 30/40
4/2/2017 2:PointEstimationandIntervalEstimation

1115050
1215646
1316860
1415955
1516050
1617269
1717556
1816956
1916972
2015656

gplotc2c1

correlc2c1

Correlationoffwtandfht=0.673

Forregressionontheotherhand,onevariableyisregardedasanoutcomeof,orresponseto,theothervariablex.Thetwovariablesarenot
interchangeable.

Wefindtheequationforastraightline"y=a+b.x"whichsummarisestherelationshipbetweenthem.Hereyistheoutcome,responseordependent
variable(output),whilexisthepredictor,explanatoryorindependentvariable(input).

Oftenvaluesofxarechosenbytheinvestigatorandarenotrandom,whereasyvaluesaremeasurementswhicharetreatedasrandomvariablestaken
fromdistibutionsthatdependonx.

ExampleHealthexpendituregrowthinAustralia19945

Thefollowingtableshowstherateofgrowth(%)ofhealthexpenditureperpersoninAustraliaatconstant19845prices.

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 31/40
4/2/2017 2:PointEstimationandIntervalEstimation

(Source:AustInst.ofHealth,"HealthExpenditure",Info.Bull.No6,May1991)

Year19834845856867878889

Growth4.62.74.12.31.82.0

Denotetheyearsbyx=1for19834,x=2for19845andsoon.

MTB>plotc4c3


growth*


4.0+*




3.0+
*


*
2.0+*
*

++++++year
1.02.03.04.05.06.0

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 32/40
4/2/2017 2:PointEstimationandIntervalEstimation

Wefitaline = xusingthemethodofleastsquarestocalculateand sothatthesumofsquaresoftheverticaldistance isminimised.

Thesevaluesare

(sxyandsxaredefinedabove)

Youcancalculateand andthenpredicttheyvalueforanygivenx

= x

iscalledthefittedvalue.

AnalysisofVariance
Analysisofvarianceisatechniqueformeasuringtheimportanceofaregression,andforestimatinghowmuchunexplainedvariationinYisleft
afterallowingforX.

Thetotal(overall)variationamongmeasuredyvalues,ignoringthex's,is .Thevariationofyvaluesfromthelineisless,andisgivenby

.Theregressionlineispreciselythelinethatmakesthissecondquantityassmallaspossible.Itcanbeshownthat

TotalVariation

Sometimes(e.g.byMINITAB)thesevaluesareshowninanAnalysisofVariance(ANOVA)table:

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 33/40
4/2/2017 2:PointEstimationandIntervalEstimation

Sourceofvariation SumofSquares(SS)
Regression ( i )2 <"explainedvariation"

(yi 2
Error i) <"unexplainedvariation"

Total (yi )2

CoefficientofDetermination

Thisisameasureofhowwellthelinefitsthedata.

Itcanbeshownthat

coefficientofdetermination=

=r2=squareofthecorrelationcoefficient.

So100r2isthepercentageofvariation"explained"bytheregressionline.

Ifthepointsalllieexactlyontheline,i.e.ifyi= i,thenthe"unexplained"(orerror)variationiszerosothecoefficientofdeterminationisone.

If i= sotheslopeoftheline iszero(i.e.noregressioneffect)thenthecoefficientofdeterminationiszero.

Ingeneral:0 coefficientofdetermination 1.

Exampleonhealthexpenditurecontinued

MTB>brief3

(note:specifiesamountofoutputseeMinitabHELPorhandbook)MTB>regressc41c3
(note:thisregressesthedependent(y)variable,
growth,on"1"predictor(x)variableyear)Theregressionequationisgrowth=4.670.500year
(note:therelationshipbetweengrowthandyear.

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 34/40
4/2/2017 2:PointEstimationandIntervalEstimation

Onaverage,growthdeclinedby1/2%peryear)PredictorCoefStdevtratiopConstant4.66670.71716.510.003year0.50000.18412.720.053s
=0.7703Rsq=64.8%Rsq(adj)=56.0%
(note:sistheestimatedstandarddeviationofthedata
abouttheregressionline.64.8%ofvariationingrowth
isexplainedbythefittedequation)AnalysisofVarianceSOURCEDFSSMSFpRegression14.37504.37507.370.053Error42.37330.5933
Total56.7483Thevaluepredictedbytheregress.eqn.\/Obs.yeargrowthFitStdevFit*ResidualSt.Resid*11.004.6004.1670.5570.4330.822
2.002.7003.6670.4190.9671.4933.004.1003.1670.3280.9331.3444.002.3002.6670.3280.3670.5355.001.8002.1670.4190.3670.576
6.002.0001.6670.5570.3330.63*Nottreatedinthiscourse.

STATISTICALINFERENCE
8:Summary

SUMMARY
HowToChooseTheAppropriateMethodForStatisticalInference
1.Arethedatacontinuousorcategorical:?
2.Howmanysetsofmeasurements?

one1variablemeasuredon1sample
two1variablemeasuredon2(independentordependent)samples
two2variablesmeasuredon1sample

Setsofmeasurements
Scale 1sample 2samples 1sample
1variable 1variable 2variables
correlationor
continuous zort zort
regression
binomialor binomialor
categorical
2 2 2

1sample/group1variablecontinuousscale
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 35/40
4/2/2017 2:PointEstimationandIntervalEstimation

Model:assume

Tomakeinferencesabout

useZ= ~N(0,1)ifisknown

OR

use ~tn1ifisestimatedbys.

2samples/groups1variablecontinuousscale

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 36/40
4/2/2017 2:PointEstimationandIntervalEstimation

1sample/group2variablesbothcontinuous
(i)Arebothvariablesrandom(e.g.measurementsorresponses)orwerethevaluesofonevariablechosenbytheinvestigator?

(ii)Associationorprediction?

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 37/40
4/2/2017 2:PointEstimationandIntervalEstimation

1sample1variablecategorical

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 38/40
4/2/2017 2:PointEstimationandIntervalEstimation

2samples/groups1variablecategorical

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 39/40
4/2/2017 2:PointEstimationandIntervalEstimation

1sample/group2variablescategorical

Dataarefrequenciesincontingencytableusethe2test

http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3&gt=1 40/40

Anda mungkin juga menyukai