Anda di halaman 1dari 3

ConfidenceIntervals

PartI
Does the mean vitamin C blood level of smokers differ from that of nonsmokers? Let's suppose for a
moment they do,withsmokers tending to have lower levels. Nevertheless, we wouldn't expectevery
smokertohavelevelslowerthanthoseofeverynonsmoker.Therewouldbesomeoverlapinthetwo
distributions. This is one reason why questions like this are usually answered in terms of population
means,namely,howthemeanlevelofallsmokerscomparestothatofallnonsmokers.
The statistical tool used to answer such questions is the confidence interval (CI) for the difference
betweenthetwopopulationmeans.Butlet'sforgettheformalstudyofstatisticsforthemoment.What
might you do to answer the question if you were on your own? You might get a random sample of
smokersandnonsmokers,measuretheirvitaminClevels,andseehowtheycompare.Supposewe've
doneit.Inasampleof40Bostonmalesmokers,vitaminClevelshadameanof0.60mg/dlandanSDof
0.32mg/dlwhileinasampleof40Bostonmalenonsmokers(Strictlyspeaking,wecanonlytalkabout
Bostonareamalesratherthanallsmokersandnonsmokers.Nooneeversaidresearchwaseasy.),the
levelshadameanof0.90mg/dlandanSDof0.35mg/dl.Thedifferenceinmeansbetweennonsmokers
andsmokersis0.30mg/dl!
Thedifferenceof0.30looksimpressivecomparedtomeansof0.60and0.90,butweknowthatifwe
weretotakeanotherrandomsample,thedifferencewouldn'tbeexactlythesame.Itmightbegreater,
itmightbeless.Whatkindofpopulationdifferenceisconsistentwiththisobservedvalueof0.30mg/dl?
How much larger or smaller might the difference in population means be if we could measure all
smokers and nonsmokers? In particular, is 0.30 mg/dl the sort of sample difference that might be
observed if there were no difference in the population mean vitamin C levels? We estimate the
difference in mean vitamin C levels at 0.30 mg/dl, but 0.30 mg/dl "giveortake what"? This is where
statisticaltheorycomesin.
Onewaytoanswerthesequestionsisbyreportinga95%confidenceinterval.A95%confidenceinterval
isanintervalgeneratedbyaprocessthat'sright95%ofthetime.Similarly,a90%confidenceintervalis
an interval generated by a process that's right 90% of the time and a 99% confidence interval is an
interval generated by a process that's right 99% of the time. If we were to replicate our study many
times, each time reporting a 95% confidence interval, then 95% of the intervals would contain the
populationmeandifference.Inpractice,weperformourstudyonlyonce.Wehavenowayofknowing
whether our particular interval is correct, but we behave as though it is. Here, the 95% confidence
interval forthe difference in meanvitamin Clevels between nonsmokers and smokers is 0.15 to 0.45
mg/dl.Thus,notonlydoweestimatethedifferencetobe0.30mg/dl,butweare95%confidentitisno
lessthan0.15mg/dlorgreaterthan0.45mg/dl.
In theory, we can construct intervals of any level of confidence from 0 to 100%. There is a tradeoff
betweentheamountofconfidencewehaveinanintervalanditslength.A95%confidenceintervalfora
population mean difference is constructed by taking the sample mean difference and adding and
subtracting 1.96 standard errors of the mean difference. A 90% CI adds and subtracts 1.645 standard
errors of the mean difference, while a 99% CI adds and subtracts 2.57 standard errors of the mean
difference. The shorter the confidence interval, the less likely it is to contain the quantity being
estimated.Thelongertheinterval,themorelikelytocontainthequantitybeingestimated.Ninetyfive
percenthasbeenfoundtobeaconvenientlevelforconductingscientificresearch,soitisusedalmost
universally. Intervals of lesser confidence would lead to too many misstatements. Greater confidence
wouldrequiremoredatatogenerateintervalsofusablelengths.

ConfidenceIntervalsandPvalues

PartII
[Zeroisaspecialvalue.Ifadifferencebetweentwomeansis0,thenthetwomeansareequal!]
Confidence intervals contain population values found to be consistent with the data. If a confidence
intervalforameandifferenceincludes0,thedataareconsistentwithapopulationmeandifferenceof
0. If the difference is 0, the population means are equal. If the confidence interval for a difference
excludes0,thedataarenotconsistentwithequalpopulationmeans.Therefore,oneofthefirstthings
to look at is whether a confidence interval for a difference contains 0. If 0 is not in the interval, a
differencehasbeenestablished.IfaCIcontains0,thenadifferencehasnotbeenestablished.Whenwe
start talking about significance tests, we'll refer to differences that exclude 0 as a possibility as
statisticallysignificant.Forthemoment,we'llusethetermsparingly.
Astatisticallysignificantdifferencemayormaynotbeofpracticalimportance.Statisticalsignificance
and practical importance are separate concepts. Some authors confuse the issues by taking about
statisticalsignificanceandpracticalsignificanceorbytalkingabout,simply,significance.Inthesenotes,
there will be no mixing and matching. It's either statistically significant or practically important any
othercombinationshouldbeconsciouslyavoided.
Serumcholesterolvalues(mg/dl)inafreelivingpopulationtendtobebetweenthemid100sandthe
high200s.Itisrecommendedthatindividualshaveserumcholesterolsof200orless.Achangeof1or2
mg/dlisofnoimportance.Changesof1020mg/dlandmorecanbeexpectedtohaveaclinicalimpact
ontheindividualsubject.Consideraninvestigationtocomparemeanserumcholesterollevelsproduced
bytwodietsbylookingatconfidenceintervalsfor 1 2 basedon x1 x2 .Highcholesterollevelsare
bad. If x1 x2 is positive, the mean from diet 1 is greater than the mean from diet 2, and diet 2 is
favored. If x1 x2 is negative, the mean from diet 1 is less than the mean from diet 2, and diet 1 is
favored.Herearesixpossibleoutcomesofexperiment.

x1 x2

95%CI

Case1

(1,3)

Case2

30

(20,40)
(2,58)

Case3

30

Case4

(1,3)

Case5

(58,62)

Case6

30

(2,62)

For each case, let's consider, first, whether a difference between population means has been
demonstratedandthenwhattheclinicalimplicationsmightbe.
Incases13,thedataarejudgedinconsistentwithapopulationmeandifferenceof0.Incases46,the
dataareconsistentwithapopulationmeandifferenceof0.
Case1:

Thereisadifferencebetweenthediets,butitisofnopracticalimportance.

Case2:

Thedifferenceisofpracticalimportanceeventhoughtheconfidenceintervalis20
mg/dlwide.

Case3:

Thedifferencemayormaynotbeofpracticalimportance.Theintervalistoowideto
sayforsure.Morestudyisneeded.

Case4:

Wecannotclaimtohavedemonstratedadifference.Weareconfidentthatifthere
isarealdifferenceitisofnopracticalimportance.

Cases5and6:

We cannot claim to have demonstrated a difference. The population mean


difference is not well enough determined to rule out all cases of practical
importance.

Page1

Cases 5 and 6 require careful handling. While neither interval formally demonstrates a difference
betweendiets,case6iscertainlymoresuggestiveofsomethingthanCase5.Bothcasesareconsistent
withdifferencesofpracticalimportanceanddifferencesofnoimportanceatall.However,Case6,unlike
Case5,seemstoruleoutanyadvantageofpracticalimportanceforDiet1,soitmightbearguedthat
Case6islikeCase3inthatbothareconsistentwithimportantandunimportantadvantagestoDiet2
whileneithersuggestsanyadvantagetoDiet1.
Itiscommontofindreportsstatingthattherewasnodifferencebetweentwotreatment.AsDouglas
AltmanandMartinBlandemphasize,absenceofevidenceisnotevidenceofabsence,thatis,failureto
showadifferenceisnotthesamethingasshowingtwotreatmentsarethesame.OnlyCase4allowsthe
investigatorstosaythereisnodifferencebetweenthediets.Theobserveddifferenceisnotstatistically
significant and, if it should turn out there really is a difference (no two population means are exactly
equaltoaninfinitenumberofdecimalplaces),itwouldnotbeofanypracticalimportance.
Manywritersmakethemistakeofinterpretingcases5and6tosaythereisnodifferencebetweenthe
treatmentsorthatthetreatmentsarethesame.Thisisanerror.Itisnotsupportedbythedata.Allwe
cansayincases5and6isthatwehavebeenunabletodemonstrateadifferencebetweenthediets.We
cannot say they are the same. The data say they may be the same, but they may be quite different.
Studies like thisthat cannot distinguish between situations that have very different implicationsare
saidtobeunderpowered,thatis,theylackthepowertoanswerthequestiondefinitivelyonewayorthe
other.
Insomesituations,it'simportanttoknowifthereisaneffectnomatterhowsmall,butinmostcasesit's
hardtorationalizesayingwhetherornotaconfidenceintervalcontains0withoutreportingtheCI,and
sayingsomethingaboutthemagnitudeofthevaluesitcontainsandtheirpracticalimportance.IfaCI
doesnotinclude0,areallofthevaluesintheintervalofpracticalimportance?IftheCIincludes0,have
effectsofpracticalimportancebeenruledout?IftheCIincludes0ANDvaluesofpracticalimportance,
YOUHAVEN'TLEARNEDANYTHING!
Copyright1999GerardE.Dallal
Lastmodified:05/23/201202:51:33.
http://www.JerryDallal.com/LHSP/ci.htm

Pvalues
To understand Pvalues, you have to understand fixed level testing. With fixed level testing, a null
hypothesisisproposed(usually,specifyingnotreatmenteffect)alongwithalevelforthetest,usually
0.05. All possible outcomes of the experiment are listed in order to identify extreme outcomes that
wouldoccurlessthan5%ofthetimeinaggregateifthenullhypothesisweretrue.Thissetofvaluesis
knownasthecriticalregion.Theyarecriticalbecauseifanyofthemareobserved,somethingextreme
hasoccurred.Dataarenowcollectedandifanyoneofthoseextremeoutcomesoccurtheresultsare
saidtobesignificantatthe0.05level.Thenullhypothesisisrejectedatthe0.05levelofsignificanceand
one star (*) is printed somewhere in a table. Some investigators note extreme outcomes that would
occurlessthan1%ofthetimeandprinttwostars(**)ifanyofthoseareobserved.
The procedure is known as fixed level testing because the level of the test is fixed prior to data
collection. In theory if not in practice, the procedure begins by the specifying the hypothesis to be
testedandtheteststatistictobeusedalongwiththesetofoutcomesthatwillcausethehypothesisto
berejected.Onlythenaredatacollectedtoseewhethertheyleadtorejectionofthenullhypothesis.
Manyresearchersquicklyrealizedthelimitationsofreportingonlywhetheraresultachievedthe0.05
levelofsignificance.Wasaresultjustbarelysignificantorwildlyso?Woulddatathatweresignificantat
the0.05levelbesignificantatthe0.01level?Atthe0.001level?Eveniftheresultarewildlystatistically
significant,istheeffectlargeenoughtobeofanypracticalimportance?
Ascomputersbecamereadilyavailable,itbecamecommonpracticetoreporttheobservedsignificance
level (or Pvalue)the smallest fixed level at which the the null hypothesis can be rejected. If your
personalfixedlevelisgreaterthanorequaltothePvalue,youwouldrejectthenullhypothesis.Ifyour
personalfixedlevelislessthantothePvalue,youwouldfailtorejectthenullhypothesis.Forexample,
ifaPvalueis0.027,theresultsaresignificantforallfixedlevelsgreaterthan0.027(suchas0.05)and
notsignificantforallfixedlevelslessthan0.027(suchas0.01).Apersonwhousesthe0.05levelwould
rejectthenullhypothesiswhileapersonwhousesthe0.01levelwouldfailtorejectit.
A Pvalue is often described as the probability of seeing results as or more extreme as those actually
observedifthenullhypothesisweretrue.Whilethisdescriptioniscorrect,itinvitesthequestionofwhy
we should be concerned with the probability of events that have not occurred! (As Harold Jeffreys
quipped,"WhattheuseofPimplies,therefore,isthatahypothesisthatmaybetruemayberejected
becauseithasnotpredictedobservableresultsthathavenotoccurred.")Infact,wecarebecausetheP
valueisjustanotherwaytodescribetheresultsoffixedleveltests.
Everysooften,acallismadeforabanonsignificancetests.Papersandbooksarewritten,conferences
areheld,andproceedingsarepublished.ThemainreasonbehindthesemovementsthisisthatPvalues
tellusnothingaboutthemagnitudesoftheeffectsthatmightleadtoustorejectorfailtorejectthenull
hypothesis. Significance tests blur the distinction between statistical significance and practical
importance. It is possible for a difference of little practical importance to achieve a high degree of
statistical significance. It is also possible for clinically important differences to be missed because an
experiment lacks the power to detect them. However, significance tests provide a useful summary of
the data and these concerns are easily remedied by supplementing significance tests with the
appropriateconfidenceintervalsfortheeffectsofinterest.
Whenhypothesesofequalpopulationmeansaretested,determiningwhetherPislessthan0.05isjust
anotherwayofexaminingaconfidenceintervalforthemeandifferencetoseewhetheritexcludes0.
Thehypothesisofequalitywillberejectedatlevel ifandonlyifa 100(1 )% confidenceintervalfor
themeandifferencefailstocontain0.Forexample,thehypothesisofequalityofpopulationmeanswill
be rejected at the 0.05 level if and only if a 95% CI for the mean difference does not contain 0. The
hypothesiswillberejectedatthe0.01levelifandonlyifa99%CIdoesnotcontain0,andsoon.
This is a good time to revist the cholesterol studies presented during the discussion of confidence
intervals.Weassumedatreatmentmeandifferenceofacoupleofunits(mg/dl)wasofnoconsequence,

ConfidenceIntervalsandPvalues

Page2

butdifferencesof10mg/dlanduphadimportantpublichealthandpolicyimplications.Thediscussion
and interpretation of the 6 cases remains the same, except that we can add the phrase statistically
significanttodescribethecaseswherethePvaluesarelessthan0.05.

x1 x 2

se( x1 x 2 )

ttest
statistic

Pvalue

Pvalue
<0.05

95%CI

Practical
Importance

Case1

0.5

<0.0001

(1,3)

Case2

30

<0.0001

(20,40)

Case3

30

14

2.1

0.032

(2,58)

Case4

0.317

(1,3)

Case5

30

0.1

0.947

(58,62)

Case6

30

16

1.9

0.061

(2,62)

Significancetestscantelluswhetheradifferencebetweensamplemeansisstatisticallysignificant,that
is,whethertheobserveddifferenceislargerthanwouldbeduetorandomvariationiftheunderlying
populationdifferencewere0.Butsignificancetestsdonottelluswhetherthedifferenceisofpractical
importance.Statisticalsignificanceandpracticalimportancearedistinctconcepts.
Incases13,thedataarejudgedinconsistentwithapopulationmeandifferenceof0.ThePvaluesare
lessthan0.05andthe95%confidenceintervalsdonotcontain0.Thesamplemeandifferenceismuch
largerthancanbeexplainedbyrandomvariabilityaboutapopulationmeandifferenceof0.Incases4
6,thedataareconsistentwithapopulationmeandifferenceof0.ThePvaluesaregreaterthan0.05
and the 95% confidence intervals contain 0. The observed difference is consistent with random
variabilityabout0.
Case1:

Thereisastatisticallysignificantdifferencebetweenthediets,butthedifferenceisof
nopracticalimportance,beingnogreaterthan3mg/dl.

Case2:

Thedifferenceisstatisticallysignificantandisofpracticalimportanceeventhoughthe
confidence interval is 20 mg/dl wide. This case illustrates that a wide confidence
intervalisnotnecessarilyabadthing,ifallofthevaluespointtothesameconclusion.
Diet2isclearlysuperiortodiet1,eventhoughwethelikelybenefitcan'tbespecified
towithinarangeof20mg/dl.

Case3:

The difference is statistically significant but it may or may not be of practical


importance.Theconfidenceintervalistoowidetosayforsure.Thedifferencemaybe
as little as 2 mg/dl, but could be as great as 58 mg/dl. More study may be needed.
However, knowledge of a difference between the diets, regardless of its magnitude,
may lead to research that exploits and enhances the beneficial effects of the more
healthfuldiet.

Case4:

Thedifferenceisnotstatisticallysignificantandweareconfidentthatifthereisareal
differenceitisofnopracticalimportance.

Cases5and6: Thedifferenceisnotstatisticallysignificant,sowecannotclaimtohavedemonstrateda
difference.However,thepopulationmeandifferenceisnotwellenoughdeterminedto
ruleoutalldifferencesofpracticalimportance.

inthedata.Insomesituations,it'simportanttoknowifthereisaneffectnomatterhowslight,butin
most cases it's hard to justify publishing the results of a significance test without saying something
aboutthemagnitudeoftheeffect*.Ifaresultisstatisticallysignificant,isitofpracticalimportance?If
theresultisnotstatisticallysignificant,haveeffectsofpracticalimportancebeenruledout?Ifaresultis
notstatisticallysignificantbuthasnotruledouteffectsofpracticalimportance,YOUHAVEN'TLEARNED
ANYTHING!
Case5deservesanothervisitinordertounderscoreanimportantlessonthatisusuallynotappreciated
the first time 'round: "Absence of evidence is not evidence of absence!" In case 5, the observed
differenceis2mg/dl,thevalue0isnearlyatthecenteroftheconfidenceinterval,andthePvaluefor
testingtheequalityofmeansis0.947.Itiscorrecttosaythatthedifferencebetweenthetwodietsdid
not reach statistical significance or that no statistically significant difference was shown. Some
researchersrefertosuchfindingsas"negative",yet,itwouldbeincorrecttosaythatthedietsarethe
same. The absence of evidence for a difference is not the same thing as evidence of absence of an
effect. In BMJ,290(1985),1002, Chalmers proposed outlawing the term "negative trial" for just this
reason.
When the investigator would like to conjecture about the absence of an effect, the most effective
procedure is to report confidence intervals so that readers have a feel for the sensitivity of the
experiment.Incases4and5,theresearchersareentitledtosaythattherewasnosignificantfinding.
BothhavePvaluesmuchlargerthan0.05.However,onlyincase4istheresearcherentitledtosaythat
thetwodietsareequivalent:thebestavailableevidenceisthattheyproducemeancholesterolvalues
within 3 mg/dl of each other, which is probably too small to worry about. One can only hope that a
claimofnodifferencebasedondatasuchasincase5wouldneverseepublication.
Should Pvalues be eliminated from the research literature in favor of confidence intervals? This
discussionprovidessomesupportforthisproposal,buttherearemanysituationswerethemagnitude
ofaneffectisnotasimportantaswhetherornotaneffectispresent.IhavenoobjectiontousingP
valuestofocusonthepresenceorabsenceofaneffect,providedtheconfidenceintervalsareavailable
forthosewhowantthem,statisticalsignificanceisnotmistakenforpracticalimportance,andabsence
ofevidenceisnotmistakenforevidenceofabsence.
As useful as confidence intervals are, they are not a cureall. They offer estimates of the effects they
measure, but only in the context in which the data were collected. It would not be surprising to see
confidenceintervalsvarybetweenstudiesmuchmorethananyoneintervalwouldsuggest.Thiscanbe
the result of the technician, measurement technique, or the particular group of subjects being
measured,amongothercauses.Thisisoneofthethingsthatplaguesmetaanalysis,eveninmedicine
wheretheoutcomesaresupposedlywelldefined.Thisisyetanotherreasonwhysignificancetestsare
useful.Therearemanysituationswherethemostusefulpieceofinformationthataconfidenceinterval
providesissimplythatthereisaneffectortreatmentdifference.

Copyright1999GerardE.Dallal
Lastmodified:05/23/201202:51:33.
http://www.JerryDallal.com/LHSP/pval.htm

Cases 5 and 6 require careful handling. Case 6, unlike Case 5, seems to rule out any advantage of
practicalimportanceforDiet1,soitmightbearguedthatCase6islikeCase3inthatbothofthemare
consistentwithimportantandunimportantadvantagesforDiet2whileneithersuggestsanyadvantage
toDiet1.
Manyanalystsacceptillustrationssuchastheseasablanketindictmentofsignificancetests.Ipreferto
seethemasawarningtocontinuebeyondsignificanceteststoseewhatotherinformationiscontained
ConfidenceIntervalsandPvalues

Page3

Anda mungkin juga menyukai