Applied Ecology
A taxonomic distinctness index and its statistical
1998,35, properties
523-531
K.R. CLARKE and R.M. WARWICK
Centrefor Coastal and Marine Sciences, Plymouth Marine Laboratory, Prospect Place, West Roe, Plymouth
PLl3DR, UK
Surnrnary
523
524
Properties of a does impose an ordering of branch lengths which is
Introduction
taxonomic interpretable and should be used. For exarnple, even
distinctness index lt is increasingly recognized (e.g. Harper & Hawks- allowing that the only data available are in the form
worth 1994) that adequate measures of biodiversity of presence/absences, measures that rely solely on top-
within a particular taxonornic group should not be ology and/or species richness would not distinguish
merely functions of the number of species present between Fig. 2a and Fig. 2b, yet Fig. 2a clearly exhibits
and their relative abundances, but should also include greater biodiversity in the sense of richness in higher
information on the 'relatedness' ofthese species. There taxa. Similarly, PO, applied to a Linnean classification
is now a substantialliterature (Faith 1994; Humphr- (Faith 1994), has a focus exclusively on 'character
ies, Williams & Vane-Wright 1995 and referrals ther- richness' rather than 'character combinations' (in the
ein) on measures incorporating, principally, phylo- terminology of Humphríes, Williams & Vane-Wright
genetic relationships amongst species and their 1995), so that PD concentrates on higher taxon rich-
possible use in selecting species or reserves of greatest ness and ignores the evenness component in diversity.
conservation priority. Vane-Wright, Humphries & Thus, PO would not distinguish between Fig.2c and
Williams (1991), Williarns, Humphries & Vane- Fig.2d, yet Fig.2d c1early represents a less taxo-
Wright (1991) and May (1990) introduced measures nomically diverse assemblage than Fig. Zc, both in the
of distinctness based only on the topology of a phylo- sense of possessing greater vulnerability to species loss
genetic tree, appropriate when branch lengths are and in potentia1 functional inefficiency.
entirely unknown, and Faith (J 992, 1994)defined and An over-riding consideration in a comparative
justified a phylogenetic diversity (PO) measure based biodiversity study is the extent to which a puta tive
on known branch lengths: PO is simply the cumulative statistic is sensitive to differing sampling effort at
branch length of the full tree. different sites or times. lt is well-known, and demon-
This 1iterature does not appear, to date, to have strated starkly in Fig.3a--c, that standard diversity
carried over into the area of environmental moni- estimates can be strongly dependent on sampling
toring and assessrnent, where the emphasis is not on effort, particularly in so far as they are infiuenced by
choosing species to conserve but monitoring for the number of species in the sample. Species richness
environmental degradation or the benefits of reme- is crucially dependent on sampling effort and it must
diation. The considerations here are rather different: be expected that only carefully controlled and equi-
the raw material is often a set of community samples table sampling studies can provide compara tive data.
with recorded abundances for each observed species, Warwick & Clarke (1995), however, define taxonomic
rather than a single species list, thought of as a com- diversity/distinctness measures that satisfy the aboye
plete inventory. The outcome required is not a pref- requirements of incorporating higher taxon richness
erential selection of species from the inventory for and evenness concepts (see the ,1+ values in the legend
conservation status, but an assessment of whether to Fig.2) but also have an apparent insensitivity to
sampled assemblages display some pattern in bio- sampling effort (Fig. 3d-f).
diversity through time or in space. Natural variation, In the Methods that follow, the construction of ,1
and thus sampling properties of the resulting abun- and ,1* is described and the link to the Simpson diver-
dance matrices and derived indices, are of paramount sity drawn. It is demonstrated theoretically that, if ,1m
importance. Also, the basic information on species and ,1m* are defined as the values of ,1 and ,1* from a
relatedness is often just a Linnean taxonomy (Fig. 1), subset of m organisms, randomly selected from a total
a crude approximation to a phylogeny but one that of n individuals, then they are either exactly (,1m) or
........
..........
Families -
Genera -
Species -
© 1998 British
Ecological Society,
Journa/ of Applied Individuals - x,
Ecology. 35, Fig. 1. Part of a taxonomic classification, showing examples of path Iength weighls {w,J used to define taxonomic diver-
523-531 sityjdistinctness measures; conventional diversity índices utilize only the species abundances {Xi; i = 1, ... , s}.
525
K.R. Clarke &
Order
Family
-
_.
(a) (b)
-
R.M. Wanvick
Genus
Species -
(e) (d)
Family _
Genus -
Species -
Fig.2. Sorne simple, contrasting taxonomic trees for presence/absence data (i.e. ignoring species abundance information).
Diversity measures based only on topology ofthe trees would not distinguish (a) from (b), and measures based on total branch
length would not distinguish (e) from (d), but taxonomie distinctness 8 +, based on the average of pair-wise path lengths
(eguation4 ofthe text), does draw these distinetions. Using a simple (1, 2, 3, ... ) weighting ofpath lengths, 8+ values are (a)
3,0, (b) 1,0, (e) 1·56and (d) 1,2, plaeing the four eonfigurations in the intuitively expeeted distinctness order.
3·8 (a)
III •
. 14 (b) 1·0 (e)
J:
>- 3·4
o
al
12 I. 0·95
al ....,
e
C1> 3·0
C1>
c::
s:
10
~ 0·90
..,> o
8
C1>
c::
c::
c:: 2·6 (1) C1> 0·85
o C1> 6 >
c:: o w
c:: C1>
os 2·2 o. 0·80
s: en 4
en
1·8 2 0·75LL ~ ~ __ ~ ~
10 100 1000 10000 10 100 1000 10000 10 100 1000 10000
-~
al a.
>- 4·2 C1> ~ 4·8
.. -
c:: 4·4
. .. .E ·· ••
. I. I I I .
al
4·0 o 4·6
..
C1> c::
.I
..,> I •
. <1 i l.
..
• .1 al
.., i i II I
o
3·8 4·0 al
al
4·4
• :I
: C1>
•
E
o 3·6
c::
o
x 3·4
:
o
E
o 3·6
c::
-e
o
c::
4·2
4·0
=
os
1-
o
x Oí
os
1- 3·2
o
3·2 3·8
10 100 1000 10000 10 100 1000 10000 O 20 40 60 80 100 120
L 1998 British
Ecological Society,
Joumal ofApplied approximately (L\",*) unbiased estima tes ofthe respec- exact in the particular case where the data only records
Ecology, 35, tive true L\ and L\* for the whole sample, whatever the the presence or absence of species, not their abun-
523-531 subset size m. The unbiasedness is also shown to be dances.
526 Methods domly chosen individuals from the sarnple, con-
Properties oI a ditional on them being from different species. Note
taxonomic DEFINITION OF INDICES that, unlike ~, the expression for ~* is invariant to
distinctness index a scale change in .x, so that ~* could incorporate
'Taxonomic diversity' (Warwick & Clarke 1995) IS
straightforwardly cases where the data are not counts
defined, using the nota tion of Fig. 1, as:
of individuals but, say, total biomass for each species.
~= [LLi<,WUXix¡ + LiO.Xi(X¡ - 1)/2]/ It would also accommodate the use of transformed
counts, e.g. the log (1 + x) or XO 15 transformations
[LLi<,XiX¡ + LiXi(Xi - 1)/2]
commonly used in multivariate community analysis
== [LLi<,Wi¡Xix,]/[n(n - 1)/2] eqn I to down-weight the contributions of dominant species
(Clarke & Oreen 1988). A special case (in a sense, the
where x, (i = 1,... , s) denotes the abundance of the ith
ultimate down-weighting transformation) is the use
species, n ( = LiX') is the total number of individuals
only of presence/absence information for each species.
in the sample and wi¡ is the 'distinctness weight' given
The {Xi} are then all thought of as equating to unity
to the path length Iinking species i and j in the hier-
(for species that are present) and ~ and ~* reduce to
archical c1assification. The double summations are
the same statistic, namely:
over all pairs of species i and j (with i < i. for sake of
definiteness). The first form of equation I exemplifies
eqn 4
the construction of ~ as a pair-wise, averaged (weigh-
ted) path length in the diagram; the (null) second term
where s is the number of species present and, for the
in the numerator is included here to emphasize the
double sumrnation, i andj range over these s species.
zero path length defined for two individuals of the
The statistical results of the paper concentrate only
same species. In more formal statistical terrns, ~ is
on two cases: when the {x,} are (untransformed)
the expected path length between any two randomly
counts and when only presence/absence information
chosen individual s from the sample. The second, more is available. They encompass any weighting scheme
succinct, form of equation 1 shows the relationship of
for the {wi;}. The examples in this and the companion
~ to standard diversity indices: when Wu == 1 (for all
paper (Warwick & Clarke 1998), however, all use the
i < j), i.e. when the taxonomic hierarchy is ignored, ~ simplest possible weights, in the context of the c1ass
reduces to a form of Simpson diversity (e.g. Pielou of free-living marine nematodes: W = 1 (species in the
1975), namely:
same genus), 2 (same family but different genera), 3
~ = [2LLi<iPp¡]/(1 - n-I), where n, = xln, (same suborder but different families), 4 (same order
but different suborders), 5 (same subclass but different
= [(L,pY - L,p/]/(I - n-I) orders), 6 (different subclasses). For exarnple, treating
= (1 - L,p/)/(1 - n-I). eqn 2 the subtree in Fig. l as a complete sample, a simple
(O, 1,2) weighting ofthe levels gives, from equations 1,
Indeed, the Simpson index was first constructed 3 and 4, the values ~ = I048, ~* = 1·82 and
from the probability that any two organisms, selected ~ + = 1·80. Note, however, that any other set of
at random from the full set of individuals, are from increasing weights would honour the structure implied
the same species (Simpson 1949).Taxonomic diversity by a taxonomic c1assification. A natural refinement
~ can therefore be seen as a generalization ofSimpson would be for the weights to depend on the quantitative
diversity, incorporating an element of taxonomic reduction in taxon richness on moving up the hier-
relatedness. archy, although the number of species per genus, gen-
This motivates the introduction of a second index, era per farnily, etc., would need to be set globally for
'taxonornic distinctness, ~* (Warwick & Clarke each faunal group, in sorne way, for the index to be
1995), which is modified to remove sorne of the overt comparable across studies.
dependence of ~ on the species abundance dis- For all three indices, the effect of the denominator
tribution represented by the {x,}. It divides the taxo- terms in equations 1, 3 and 4 is to reduce or eliminate
nomic diversity of equation I by the Simpson-type direct dependence on the number of species s: thus in
index of equation 2, i.e. ~ is divided by its value when Fig. 3d-f there is an apparently static mean value for
the hierarchical c1assification collapses to the special ~, ~* and ~ +, whatever the subsample size (or, in the
case of all species belonging to a single genus. By case of Fig. 3f, whatever the number of species in the
dcfinition, the resulting ratio, ~ *, must be more nearly subset). Naturally, the variance increases with
a function of pure taxonomic relatedness of indi- decreasing information in all cases. The ~ statistics
viduals. The algebraic definition of taxonomic dis- cannot therefore c1aim to represent all aspects of a
tinctness is: cornmunity's diversity: if the results of Fig.3 have
(i.) 1998 British
Eco1ogica1Society,
sorne theoretical generality then they can be seen to
eqn 3
Journal ofApplied partition out from species richness sorne combination
Ecolog_)',35, and an alternative way of viewing this is as the ofhigher taxonomic spread and evenness, making the
523-531 expected (weighted) path length between any two ran- c1aim that average distinctness in a sample can be
527 reliably estimated, while total distinctness in the sam- These two 0'2 terms are straightforward properties of
K.R. Clarke & ple (which clearly depends on the richness) cannot. the taxonomic tree for the full species set, with a,}
R.M. Warwick corresponding to the variance of all path lengths {w;¡}
between different species, and 0',,/ the variance of the
MEANS ANO VARIANCES
mean path lengths {w;} from each species to al! others.
For the model underlying Fig.3d,e, the fuI! set of Note that equation 5 is an exact result not a Taylor
species abundances {x;: i = 1, ... , s} and the total series approximation.
species richness s are thought of as fixed, and the 'true' These sampling properties now motiva te a statis-
taxonomic diversity and distinctness values are given tical test for increase or decrease in observed taxo-
by equations I and 3, with the {w,¡} being known nomic distinctness, based either on direct simulation
weights. A random subsample (without replacement) or approximate confidence intervals (of the usual
of a fixed number of organisms, m, is taken from the mean ± 2 SO form), constructed from the variance
fuI! set of individual s n (=L;X;), and ~m and ~",* expression of equation 5.
denote the ca1culated taxonomic diversity and dis-
tinctness values from that subsarnple. The Appendix
Results
(case 1) shows quite generally that ~'" is an unbiased
estima te of ~, and ~,,,* is approximately unbiased for
A PRACTICAL TEST FOR CHANGE IN
~* (using a Taylor series), whatever the subsample
TAXONOMIC OISTINCTNESS
size m and the structure of the trees,
An important special case is when only pres- The fact that, for presence/absence data, the dis-
encejabsence information is available, and the sub- tinctness estimate (~n'+) from a subset of m species
samples now draw species at random (without replace- unbiasedly estimates the distinctness (A") of the full
ment): m species from the full set of s. The distinctness set, suggests the following test scenario for situations
for the subsarnple, ~",+, is again an (exactly) unbiased in which, at first sight, no valid diversity comparisons
estimate of ~ + for the fuI! species set, for any m (see seem possible, The starting assumption is that there
the Appendix, case 2). This firms up the suggestion exists a reasonably comprehensive species list (inven-
from Fig. 3f that the mean of a number of repeated tory) for a region, within which certain localities are
subsamples at each size is constant, and there is no postulated to have reduced diversity. If the only data
subsampling bias. available at these localities are local species lists from
The Appendix develops these results for mean one-off studies, and there is no control ofthe sampling
values formally, in order to set the symbolism for effort expended in each location (or in constructing
derivation ofvariance formulae, but note that there is the regional inventory), then the only conventional
a simpler heuristic explanation for the exact unbiased- diversity measure calculable - the number of species
ness of ~'" and ~m+' For exarnple, the expectation found at each locality - is uninterpretable. However,
of ~m + is just the expected path length between two the aboye results show that one can unbiasedly com-
randomly selected species from a subset of m species. pare taxonornic distinctness at a locality with that for
However, the latter subset is selected randomly from the global list. For the null hypothesis of no differcnce,
the fuI! set of s species, and a random pair from a a randomization test can be performed by repeatedly
random sample of m species (m > 1) is also a random subsampling species sets of size m, drawn at random
pair from the full set of s species, By definition, ~ + is from the global list, and constructing the histogram
the expected path length for a randomly selected pair of the resulting ~m + estirnates. These will centre
of species from the full set of s species, so it must around the global distinctness of ~ + and the spread
follow that E(~",+) = ~". Similar reasoning yields the of the simulated values can be used to determine if the
exact unbiasedness result for ~'" but not for ~",*, observed ~",+ for that locality is at variance with the
because of the conditionality clause in its definition; null hypothesis.
recourse needs to be made to the Taylor series Figure 4 is based on a UK species list for free-living
approximation of the Appendix. marine nematodes (s = 395; see the companion paper
The Appendix then goes on to show that the vari- Warwick & Clarke 1998), a nematode species list
ance ofthe subsampleestimate ~m +has the followingform: (m = 122) from combined core samples taken over the
var(~",+) = 2(.1' - m)[m(m - 1)(.1' - 2)(.1' - 3)]-1 course of ayear at eight sandy sites in the Exe estuary,
England, UK (Warwick 1971), and a further nema-
[(s - m - 1)0'2,,,+ 2(.1' - 1)(/11- 2)0'2",] eqn 5 tode species list (m = 111) from six sandy sites in the
where: Clyde cstuary, Scotland, UK (Lambshead 1986). For
a,} = [(L;L¡I";)W/)/s(s - 1)]- (j/ eqn 6 a total of 1000 random samples of size m = 122 (for
{ 1998 British
Fig.4a), and a further 1000 random samples with
Eco1ogica1Society,
a,e/ = [(L,6)/)/S] - di eqn 7 m = III (for Fig. 4b), drawn from the global list, the
Journol ofApplied ~'" + estima tes give the histograms of Fig. 4a,b, show-
eqn 8
Ec%gy. 35, ing the typically rather narrow range of distinctness
523-531 w = (L;W¡}/S = (L;L¡(,,;)W,¡)/[s(s - 1)] == ~+. eqn 9 values commensurate with the null hypothesis for
528 160 Exe sands (a) 160 Clyde sands (b)
Properties 01a A+ = 4·75
taxonomic
120 120
distinctness index >.
o
c:
Q)
::1 80 80
O'
Q)
'-
LL
] ~IIIIIII~
4·4 4·5 4·6 4·7 4·8 4·9 4·4 4·5 4·6 4·7 4·8 4·9
Taxonomic distinctness A+
Fig.4. Histogram of ¡\+ values for 1000 random subsamples of a fixed number m of species, from a full list of free-living
marine nematodes of the UK (s = 395 species): (a) m = 122, (b) m = 111, corresponding to the sublist sizes for combined
samples at intertidal sandy sites in the Exe and Clyde estuaries, respectively. The true ¡\ values for both localities are also
T
indicated: for the Clydc, the null hypothesis that the average distinctness equates with that for the UK as a whole is clearly
rejected (P < 0·1%).
these subsample sizes. The true ~m + for the Exe estu- for the Chilean nematode data of the companion
ary sands, of 4,75, lies centrally to the distribution of paper; Warwick & Clarke 1998) but the normality
Fig. 4a and therefore provides no evidence of a differ- approximation to the lower confidence limit (the
ent average distinctness at this locality than in the UK important limit in practice) is good enough to suggest
region as a whole. To reject the null hypothesis, at that this may be a useful short-cut to the full ran-
approximately the 5% level, the true ~n'+ would need domization procedure, in non-borderline cases, when
to fall below the 25th lowest (of 1000) simulated ~m + computing power is limiting. An improved empirical
values in the histogram, or aboye the 25th highest. In fit could doubtless be constructed from an expression
contrast, the true ~m + for the Clyde sands (4-46) is for the third moment of ~m +.
below this lower limit in Fig. 4b and in fact it is smaller
than any of the 1000 simulated values, so there is
Discussion
significant evidence of a lower taxonomic distinctness
here than for the UK as a whole (P < 0·1%). As shown in Fig. 5, distinctness values for any specific
The computational burden of this large number locality, habitat type, pollution condition, etc., can
of simulations, which needs to be repeated for every be plotted on the confidence funnel created from a
locality under test (with a different species subset sizc), regional species list, to test for significant departures
can be heavy, although not usually prohibitive. A from the null hypothesis (that a particular subsample
much faster, approximate procedure is provided by behaves, in terms of its pair-wise average distinctness,
the variance formula of equation 5. The constants (J} as if it were a random sample from the larger list).
and (J",2 in this expression are a function only of the The companion paper, Warwick & Clarke (1998),
tree structure of the globallist (of s species) and need applies and interprets this method in a range of situ-
to be calculated only once (for all marine nematodes ations,
of the UK, for example). The variance expression is It is perhaps surprising that a diversity test of any
then a rather simple function of subsample size m and sort should be possible in a case where sampling effort
these constants, so that an approximate 95% con- is uncontrolled and the only data consist of presence
fidence 'funnel' (mean ± 2 SD) can easily be con- or absence of species, Indeed, the test could not be
structed over the full range of m-values. Here the mean expected to have the same sensitivity as that obtain-
is equal to ~ + for the globallist ( = 4·72 for UK marine able from a wider range of diversity measures (or
nematodes) and the SD is the square root of the vari- multivariate analysis) calculable from abundance data
ance expression in equation 5. Figure 5 displays this in carefully standardized sampling plans. The key
funnel (the smooth, darker lines) and contrasts it with point to recognize here is that certain diversity
the results of extensive simulation runs (the circles, features, most obviously the number of species re-
joined by lighter lines) for subset lists of m = 10, 15, corded in a sample, are highly dependent on the sam-
20, 25, ... , 350 species. At each point there are 1000 pling regime, and can only be straightforwardly com-
© 1998 British
random selections and the circles denote the 25th low- pared under conditions of comparable sampling
Ecological Society,
Journal of Applied est and 25th highest distinctness values (simulated effort. The same caveats will apply to other diversity
Ecology.35, 95% confidence limits), There is clear evidence of a totals, such as PD, the total phylogenetic or taxo-
523-531 left-skewed distribution for ~n'+ in this case (as also nomic branch length in a subtree for a particular
529 5·4 UK nematodes
K.R. Clarke &
R.M. Warwick '4 5·2
(/)
(/)
-
al
c: 5·0 Exe aanda
o
c:
:;:
.~ 4·8
"C .................... ---te--- .. -------- .. -------- .... ----.--.---------
o
'EO 4.6
e
~ +
~ 4·4
Clyde
/ .--- .. Simulated mean (true mean Is 4·72)
aanda ___ Simulated 95% confidence limita
4·2
Theoretical 95% confidence limita
4·0
o 100 200 300 400
Subset size (m)
Fig.5. Confidence funnels for the Ó + randomization test, from the all-UK list of marine nematode species. Circles correspond
to direct randomization results for each sublist size, and smooth (thick) lines to approximate limits using the variance formula
of equation 5. The dashed line gives the mean Ó + over each simulation, confirming the theoretical unbiasedness result
(L'. + = 4·72 for the full set of 395 species).