Radiology
Statistical inference involves two analysis methods: estimation and hypothesis testing,
Index term: the latter of which is the subject of this article. Specifically, Z tests of proportion are
Statistical analysis highlighted and illustrated with imaging data from two previously published clinical
studies. First, to evaluate the relationship between nonenhanced computed tomo-
Published online before print
10.1148/radiol.2263011500 graphic (CT) findings and clinical outcome, the authors demonstrate the use of the
Radiology 2003; 226:609 – 613 one-sample Z test in a retrospective study performed with patients who had ureteral
calculi. Second, the authors use the two-sample Z test to differentiate between primary
Abbreviations: and metastatic ovarian neoplasms in the diagnosis and staging of ovarian cancer. These
H0 ⫽ null hypothesis
H1 ⫽ alternative hypothesis
data are based on a subset of cases from a multiinstitutional ovarian cancer trial
RDOG ⫽ Radiologic Diagnostic conducted by the Radiologic Diagnostic Oncology Group, in which the roles of CT,
Oncology Group magnetic resonance imaging, and ultrasonography (US) were evaluated. The statistical
formulas used for these analyses are explained and demonstrated. These methods may
1
From the Department of Radiology, enable systematic analysis of proportions and may be applied to many other radiologic
Brigham and Women’s Hospital, Har- investigations.
vard Medical School, Boston, Mass © RSNA, 2003
(K.H.Z., J.R.F., S.G.S., C.M.C.T.); and
Department of Health Care Policy,
Harvard Medical School, 180 Long-
wood Ave, Boston, MA 02115
(K.H.Z.). Received September 10, Statistics often involve a comparison of two values when one or both values are associated
2001; revision requested November 8; with some uncertainty. The purpose of statistical inference is to aid the clinician, re-
revision received December 12; ac-
cepted December 19. Supported in
searcher, or administrator in reaching a conclusion concerning a population by examining
part by Public Health Service Grant a sample from that population. Statistical inference consists of two components, estima-
NIH-U01 CA9398-03 awarded by the tion and hypothesis testing, and the latter component is the main focus of this article.
National Cancer Institute, Department Estimation can be carried out on the basis of sample values from a larger population (1).
of Health and Human Services. Ad-
Point estimation involves the use of summary statistics, including the sample mean and
dress correspondence to K.H.Z. (e-
mail: zou@bwh.harvard.edu). SD. These values can be used to estimate intervals, such as the 95% confidence level. For
Current address: example, by using summary statistics, one can determine the sensitivity or specificity of
2
Department of Radiology, University the size and location of a ureteral stone for prediction of the clinical management
of North Carolina at Chapel Hill. required. In a study performed by Fielding et al (2), it was concluded that stones larger than
© RSNA, 2003 5 mm in the upper one-third of the ureter were very unlikely to pass spontaneously.
In contrast, hypothesis testing enables one to quantify the degree of uncertainty in sam-
pling variation, which may account for the results that deviate from the hypothesized values
in a particular study (3,4). For example, hypothesis testing would be necessary to determine if
ovarian cancer is more prevalent in nulliparous women than in multiparous women.
It is important to distinguish between a research hypothesis and a statistical hypothesis.
The research hypothesis is a general idea about the nature of the clinical question in the
population of interest. The primary purpose of the statistical hypothesis is to establish the
basis for tests of significance. Consequently, there is also a difference between a clinical
conclusion based on a clinical hypothesis and a statistical conclusion of significance based
on a statistical hypothesis. In this article, we will focus on statistical hypothesis testing
only.
In this article we review and demonstrate the hypothesis tests for both a single propor-
tion and a comparison of two independent proportions. The topics covered may provide
a basic understanding of the quantitative approaches for analyzing radiologic data. De-
tailed information on these concepts may be found in both introductory (5,6) and
advanced textbooks (7–9). Related links on the World Wide Web are listed in Appendix A.
609
and this condition is called the null hy-
pothesis (H0). On the basis of the P value
and whether it is low enough, one can
conclude that H0 is not true and that
there really is a difference. This act of
conclusion is in some ways a “leap of
Radiology
Note.—A statistical power of 1 ⫺  is analogous to the sensitivity of a diagnostic test. The with that in the literature, we conduct a
probability 1 ⫺ ␣ is analogous to the specificity of a diagnostic test.
statistical hypothesis test with five steps:
1. H0 is as follows: 80% of the ureteral
stones smaller than 6 mm will pass spon-
taneously ( ⫽ 0.80). H1 is as follows: The
what the probability is that our conclu- whose cancers are in remission. We also proportion of the stones smaller than 6
sion is wrong when in fact H0 is correct. study a two-sample problem in which tri- mm that pass spontaneously does not
The lower the P value, the less likely that als are conducted independently in two equal 80%—that is, it is either less than
our rejection of H0 is erroneous. By con- study groups. For example, one may or greater than 80% ( ⫽ 0.80). This is
vention, most analysts will not claim compare the sensitivities or specificities therefore a two-sided hypothesis test.
that they have found statistical signifi- of two imaging modalities. Similarly, pa- 2. The test statistic Z is calculated to be
cance if there is more than a 5% chance tients in one group receive a new treat- 1.29 on the basis of the results of the Z
of being wrong (P ⫽ .05). ment, whereas independently patients in test of a single proportion (5).
the control group receive a conventional 3. The P value, .20, is the sum of the
Type I and II Errors treatment, and the proportions of remis- two tail probabilities of a standard nor-
sion in the two patient populations are mal distribution for which the Z values
Two types of errors can occur in hy- are beyond ⫾1.29 (Figure).
compared.
pothesis testing: A type I error (signifi- 4. Because the P value, .20, is greater
When sample sizes are large, the ap-
cance level ␣) represents the probability than the significance level ␣ of 5%, H0 is
proximate normality assumptions hold
that H0 was erroneously rejected when in not rejected.
for both the sample proportion and the
fact it is true in the underlying popula- 5. Therefore, our data support the be-
test statistic. In the test of a single pro-
tion. Note that the P value is not the lief that 80% of the stones smaller than 6
portion () based on a sample of n inde-
same as the ␣ value, which represents the mm in diameter will pass spontaneously,
pendent trials at a hypothesized success
significance level in a type I error. The as reported in the literature. Thus, H0 is
probability of 0 (the hypothesized pro-
significance level ␣ is prespecified (5% not rejected, given the data at hand.
portion), both n0 and n(1 ⫺ 0) need to
conventionally), whereas the P value is Consequently, it is possible that a type II
be at least 5 (Appendix B). In the com-
computed on the basis of the data and error will occur if the true proportion in
parison of two proportions, 1 and 2,
thus reflects the strength of the rejection the population does not equal 80%.
based on two independent sample sizes
of H0 on the test statistic. A type II error
of n1 and n2 independent trials, respec-
(significance level ) represents the prob-
tively, both n1 and n2 need to be at least Two-Sample Z Test to Compare
ability that H0 was erroneously retained
30 (Appendix C) (5). The test statistic is Two Independent Proportions
when in fact H1 is true in the underlying
labeled Z, and, hence, the analysis is re-
population. There is always a trade-off Brown et al (14) hypothesized that the
ferred to as the Z test of a proportion.
between these two types of errors, and imaging appearances (eg, multilocular-
Other exact hypothesis-testing methods
such a relationship is similar to that be- ity) of primary ovarian tumors and met-
are available if these minimum numbers
tween sensitivity and specificity in the astatic tumors to the ovary might be dif-
are not met.
diagnostic literature (Table) (10). The ferent. Data were obtained from 280
Furthermore, the Z and Student t tests
probability 1 ⫺  is the statistical power patients who had an ovarian mass and
both are parametric hypothesis tests—
and is analogous to the sensitivity of a underwent US in the Radiologic Diagnostic
that is, they are based on data with an
diagnostic test, whereas the probability Oncology Group (RDOG) ovarian cancer
underlying normal distribution. There
1 ⫺ ␣ is analogous to the specificity of a staging trial (15,16). The study results
are many situations in radiology research
diagnostic test. showed that 30 (37%) of 81 primary ovar-
in which the assumptions needed to use
a parametric test do not hold. Therefore, ian cancers, as compared with three (13%)
STATISTICAL TESTS OF nonparametric tests must be considered of 24 metastatic neoplasms, were multiloc-
PROPORTIONS: THE Z TEST (9). These statistical tests will be dis- ular at US. To test if the respective under-
cussed in a future article. lying proportions are different, we conduct
We now focus on hypothesis testing for a statistical hypothesis test with five steps:
either a proportion or a comparison of 1. H0 is as follows: There is no differ-
two independent proportions. First, we TWO RADIOLOGIC EXAMPLES ence between the proportions of mul-
study a one-sample problem. In a set of tilocular metastatic tumors (1) and mul-
One-Sample Z Test of a Single
independent trials, one counts the num- tilocular primary ovarian tumors (2)
Proportion
ber of times that a certain interesting among the primary and secondary ovar-
event (eg, a successful outcome) occurs. Fielding et al (2) evaluated the unen- ian cancers—that is, 1 ⫺ 2 ⫽ 0. H1 is as
The underlying probability of success (a hanced helical CT features of 100 ureteral follows: There is a difference in these pro-
proportion) is compared against a hy- calculi, 71 of which passed spontane- portions: One is either less than or
pothesized value. This proportion can be ously and 29 of which required interven- greater than the other—that is, 1 ⫺ 2 ⫽
the diagnostic accuracy (eg, sensitivity or tion. According to data in the available 0. Thus, a two-sided hypothesis test is
specificity) or the proportion of patients literature (11–13), approximately 80% of conducted.