5 May 2006
Parasite prevalence (the proportion of infected hosts) is changes dramatically when we move from small to large
a common measure used to describe parasitaemias and sample sizes. For instance, if we sample six adult
to unravel ecological and evolutionary factors that lizards and find four to be infected (nZ6, iZ4) we
influence host–parasite relationships. Prevalence esti- obtain pZ(4/6)*100Z67%, but if we found five infected,
mates are often based on small sample sizes because of then pZ83%, a difference of 16%. However, for nZ100,
either low abundance of the hosts or logistical problems iZ4 returns pZ4%, and iZ5 returns pZ5%, only a 1%
associated with their capture or laboratory analysis. difference. This produces the effect that slight prevalence
Because the accuracy of prevalence estimates is lower differences can be detected at high sample sizes but not at
with small sample sizes, addressing sample size has low ones. In other words, our uncertainty about the real
been a common problem when dealing with prevalence prevalence (i.e. population prevalence, P) is higher at low
data. Different methods are currently being applied to sample sizes (Figure 2).
overcome this statistical challenge, but far from being The accuracy with which we calculate the prevalence
different correct ways of solving a same problem, some decreases not only at low sample sizes (as expected), but
are clearly wrong, and others need improvement. also when the populational prevalence is close to 50%
(Figure 2). At first, this property could seem counter-
intuitive, because rare events (e.g. a parasitised individ-
Introduction ual at PZ1% or an uninfected individual at PZ99%) are
In a given population with N individuals, the prevalence difficult to observe, particularly at low sample sizes.
(P) denotes the proportion of infected individuals by a However, when populational prevalence is close to zero or
given parasite species or group of species (e.g. a parasite 100%, sample prevalence is also close to or equal to zero
genus). The actual prevalence of a population, however, is or 100%, respectively, precisely because rare events are
usually unknown because the number of sampled hosts (n; difficult to detect. At intermediate populational preva-
the sample size) is generally lower than total population lences (e.g. PZ52%), the chance of finding parasitised or
size (N). However, we can easily obtain an estimate (p) by non-parasitised individuals is similar, and sample preva-
dividing the number of infected individuals (i) by the lence could then freely fluctuate from zero to 100%,
number of sampled ones [pZ(i/n)*100], where iZ0,1,2.n, falling away from the actual populational prevalence and
and nZ1,2,3.N. Nonetheless, the accuracy of this increasing the uncertainty in the prevalence estimate.
estimate is known to be affected by sample size. This
problem has long concerned researchers because the
Minimum sample size
results of statistical analyses dealing with prevalence
The most intuitive method to overcome the high
data and the derived conclusions could depend greatly on
statistical uncertainty of prevalences calculated from
the number of sampled hosts [1]. Here, we review current
low sample sizes is to reject data obtained from small
methods used to overcome this statistical problem,
sample sizes. However, the threshold for establishing a
identify wrong approaches, improve others and suggest
minimum sample size is highly variable because it
more powerful practices.
depends on subjective decisions of the researchers. This
leads to researchers either not considering a minimum
Basic concepts sample size (e.g. [2]), or considering low minimum sample
The low accuracy of prevalence estimates when using a sizes, such as three [3], five [4], or eight [5]; medium
small sample size has a mathematical basis. Given that sample sizes, such as ten [6], 15 [7] or 20 [8]; and higher
both the number of infected individuals and sample size ones up to 30 [9] or even 75 [10]. Obviously, more is
are integers, sampling prevalence (i.e. the prevalence better, but rather than being a linear relationship,
within a given sample of individuals) is constrained to a uncertainty rapidly decreases as sample size increases
particular set of values for each sample size (Figure 1). up to 10–20 individuals, but not much more with further
Moreover, the behaviour of prevalence estimates for increasing sample sizes (Figure 2). Thus, a sample size
slight increases in the number of infected individuals around 15 could be used as a reasonable trade-off
Corresponding author: Jovani, R. (jovani@ebd.csic.es). between not losing too much data from analyses and
Available online 13 March 2006 maintaining acceptable levels of uncertainty (around 1/3
www.sciencedirect.com 1471-4922/$ - see front matter Q 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.pt.2006.02.011
Opinion TRENDS in Parasitology Vol.22 No.5 May 2006 215
75
prevalences from their analyses or to question its previous
use. These authors stated correctly [13] that large sample
sizes (e.g. 1000 hosts) are needed to detect very low
50
prevalences (e.g. 0.1%), but that 100% prevalences could
be detected with only one individual sampled if it is
infected. They presented figures with both axes log-
25
transformed (similar to Figure 1b), but this hid the
symmetric shape of the actual relationship between
sample size and prevalence (Figure 1a,c). That is, as
0
happens with prevalences near zero, prevalences near
0 10 20 30 40 50 100% (e.g. 99.9%) could only be achieved with high sample
Sample size (n)
sizes (e.g. 1000 hosts). Thus, according to this symmetry
prevalence (p)
prevalence (p)
(b) 100 (c) 100 and the suggestions of Gregory and Blackburn [13], we
Sampling
Sampling
30
area [19]. Moreover, it has been influenced by the need to
control for sample size when analysing parasite richness,
because the more host individuals that are examined, the
20
more parasite species could be found [7,20].
Adding more support to this method, and using
empirical log-log plots similar to Figure 1b, Gregory and
10 Blackburn [13] suggested that a negative relationship
between sample size and prevalence is expected as a
100 mathematical artefact (and thus needed to be controlled
al )
0
50
t ion (P for). In addition, they stated that nothing but negative
10 20 30 40 50 60 70 0 l a e
80 90 100 pu nc slopes were found when simulating the effect of sample
Sample size (n) Po vale
e
pr size (from 1 to 3500) on prevalence estimates when zero
TRENDS in Parasitology
values were deliberately avoided. Clearly, however, there
is no mathematical relationship between prevalence and
Figrue 2. Standard error of prevalence estimates at different sample sizes and sample size per se (Figure 1). We confirmed this by
populational prevalences. Standard error (SE) is calculated using the formula: SEZ
100*[p*q/(n-1)] where p is the sample prevalence, qZ1-p, and n is the sample size.
repeating the same simulation study done in Ref. [13]. We
The white line shows results for nZ15. performed100 simulations in which 200 hypothetical
www.sciencedirect.com
216 Opinion TRENDS in Parasitology Vol.22 No.5 May 2006
species (or populations) took prevalence values varying (1–15), sampling prevalences and sample sizes were
between zero and 100% and sample sizes between 1 and usually positively correlated at low populational preva-
3500 (R.J. And J.L.T., unpublished). Among the resulting lences, uncorrelated at intermediated populational pre-
relationships, 49 were negative (Spearman rank corre- valences, and negatively correlated at high populational
lations, r rangeZK0.0008 to K0.1714; meanZK0.0565) prevalences (Figure 3a). At higher sample sizes (15–100),
and 51 positive (r rangeZ0.0045–0.2675; meanZ0.0598), however, significant correlations were fewer and were
only two negative and four positive weak correlations evenly distributed (Figure 3b). This is because, at low
being statistically significant. Moreover, the results were populational prevalence, the chances of finding an infected
identical when zero prevalences were not allowed in the
simulations. Our results differ greatly from those of Ref.
[13], suggesting that the conclusions of Gregory and (a) (b)
Blackburn were based on the visual examination of log- 0.6
log plots similar to Figure 1b.
In a more recent but largely unnoticed study regarding
0.4
the relationships between prevalence and sample size,
Spearman-r between n and p
www.sciencedirect.com
Opinion TRENDS in Parasitology Vol.22 No.5 May 2006 217
individual among a low sample number is low, but the estimated and real prevalences in the initial data
chances increase with increasing sample size; the reverse (Figure 4b). However, the correlation between real
happens at higher populational prevalences (Figure 3a). prevalences and calculated residuals is null. This means
However, when all the species have a minimum sample that the residuals are unrelated to populational preva-
size above 15, the effect of rare events on sample lences! Residuals thus become statistical artefacts that
prevalence is buffered at any population prevalence cannot be used as estimators of prevalence for compara-
(Figure 3b). tive purposes, and thus previous results obtained through
The problem with using the residuals from a regression this method should be taken with caution.
between prevalence and sample size is that it artificially Finally, it is worth noting that some relevant biological
increases the prevalence estimates of some species and factors could be shaping a prevalence–sample-size
underestimates the prevalences of others. Moreover, this relationship. For instance, in a recent study, Ricklefs et
method could not be applied even if (by chance) a al. [15] used the number of individuals trapped (the
correlation exists between prevalence and sample size in sample size) as an index of bird species abundance in a
a given data set. To illustrate this point, one can given study area. In this way, they used the relationship
deliberately create a statistical relationship between between prevalence and sample size to assess the
sample size and populational prevalence by sampling potential relationship between host density and parasite
more individuals from populations with a higher popula- prevalence. Thus, the prevalence–sample sizerelationship
tional prevalence (Figure 4). The simulated sample size should be seen as a potentially interesting pattern by
and the sample prevalence result correlate here, of course, itself, rather than a statistical artefact that should be
because they have been made to do so deliberately controlled for.
(Figure 4a). There is also the expected correlation between
Concluding remarks
Current practices for the analysis of prevalence data must
(a) be revised. Statistical tools that take into account the
sample size from which each proportion has been obtained
[23], that weight for sample size [8], or use individual
87.5 infection status (infected or not) as the dependent variable
[16] are increasingly being used. In this way, information
Sampling prevalence (p)
Residuals n–p
87.5 25
Sampling
62.5
minimum sample sizes.
0
37.5
–25 References
12.5
1 Read, A.F. and Harvey, P.H. (1989) Reassessment of comparative
evidence for Hamilton and Zuk theory on the evolution of secondary
12.5
37.5
62.5
87.5
12.5
37.5
62.5
87.5
8 Scheuerlein, A. and Ricklefs, R.E. (2004) Prevalence of blood parasites 20 Walther, B.A. and Morand, S. (1998) Comparative performance
in European passeriform birds. Proc. Biol. Sci. 271, 1363–1370 of species richness estimation methods. Parasitology 116,
9 Arneberg, P. et al. (1998) Host densities as determinants of abundance in 395–405
parasite communities. Proc. R. Soc. Lond. B. Biol. Sci. 265, 1283–1289 21 Gregory, R.D. and Woolhouse, M.E.J. (1993) Quantification of parasite
10 Poulin, R. and Mouritsen, K.N. (2003) Large-scale determinants of aggregation: a simulation study. Acta Trop. 54, 131–139
trematode infections in intertidal gastropods. Mar. Ecol. Prog. Ser. 22 Pruett-Jones, M. and Pruett-Jones, S. (1991) Analysis and ecological
254, 187–198 correlates of tick burdens in a New Guinea avifauna. In Bird-Parasite
11 John, J. (1995) Parasites and the avian spleen: helminths. Biol. Interactions (Loye, J.E. and Zuk, M., eds), pp. 155–176, Oxford
J. Linn. Soc 54, 87–106 University Press
12 Pruett-Jones, S.G. et al. (1990) Parasites and sexual selection in birds 23 Tella, J.L. et al. (1999) Habitat, world geographic range, and
of Paradise. Am. Zool. 30, 287–298 embryonic development of host explain the prevalence of avian
13 Gregory, R.D. and Blackburn, T.M. (1991) Parasite prevalence and hematozoa at small spatial and phylogenetic scales. Proc. Natl.
host sample size. Parasitol. Today 7, 316–318 Acad. Sci. U. S. A. 96, 1785–1789
14 Sol, D. et al. (2000) Geographical variation in blood parasites in feral 24 Hedges, L.V. and Olkin, I. (1985) Statistical Methods for Meta-
pigeons: the role of vectors. Ecography 23, 307–314 Analysis, Academic Press
15 Ricklefs, R.E. et al. (2005) Community relationships of avian malaria 25 Paterson, S. and Lello, J. (2003) Mixed models: getting the best use of
parasites in southern Missouri. Ecol. Monogr. 75, 543–559 parasitological data. Trends Parasitol. 19, 370–375
16 Mendes, L. et al. (2005) Disease limited distributions? Contrasts in the 26 Bennett, G.F. et al. (1982) Host–Parasite Catalogue of the Avian
prevalence of avian malaria in shorebird species using marine and Haematozoa, Occasional Papers in Biology. Memorial University of
freshwater habitats. Oikos 109, 396–404 Newfoundland
17 Harvey, P.H. and Pagel, M.D., eds (1991) The Comparative Method in 27 Bishop, M.A. and Bennett, G.F. (1992) Host-Parasite Catalogue of the
Evolutionary Biology, Oxford University Press Avian Haematozoa (Suppl. 1), Occasional Papers in Biology. Memorial
18 Poulin, R. and Valtonen, E.T. (2001) Nested assemblages resulting University of Newfoundland
from host size variation: the case of endoparasite communities in fish 28 Hamilton, W.D. and Zuk, M. (1982) Heritable true fitness and bright
hosts. Int. J. Parasitol. 31, 1194–1204 birds: a role for parasites? Science 218, 384–387
19 Garland, T., Jr. et al. (1992) Procedures for the analysis of comparative 29 Cooper, J.E. and Anwar, M.A. (2001) Blood parasites of birds: a plea for
data using phylogenetically independent contrasts. Syst. Biol. 41, 18–32 more cautious terminology. Ibis 143, 149–150
Elsevier recently announced that six million articles are now available on its premier electronic platform, ScienceDirect. This
milestone in electronic scientific, technical and medical publishing means that researchers around the globe will be able to access
an unsurpassed volume of information from the convenience of their desktop.
ScienceDirect’s extensive and unique full-text collection covers over 1900 journals, including titles such as The Lancet, Cell,
Tetrahedron and the full suite of Trends and Current Opinion journals. With ScienceDirect, the research process is enhanced with
unsurpassed searching and linking functionality, all on a single, intuitive interface.
The rapid growth of the ScienceDirect collection is due to the integration of several prestigious publications as well as ongoing
addition to the Backfiles – heritage collections in a number of disciplines. The latest step in this ambitious project to digitize all of
Elsevier’s journals back to volume one, issue one, is the addition of the highly cited Cell Press journal collection on ScienceDirect.
Also available online for the first time are six Cell titles’ long-awaited Backfiles, containing more than 12,000 articles highlighting
important historic developments in the field of life sciences.
The six-millionth article loaded onto ScienceDirect entitled "Gene Switching and the Stability of Odorant Receptor Gene Choice"
was authored by Benjamin M. Shykind and colleagues from the Dept. of Biochemistry and Molecular Biophysics and Howard
Hughes Medical Institute, College of Physicians and Surgeons at Columbia University. The article appears in the 11 June issue of
Elsevier’s leading journal Cell, Volume 117, Issue 6, pages 801–815.
www.sciencedirect.com
www.sciencedirect.com