1038/nature08656
LETTERS
A phylogeny-driven genomic encyclopaedia of
Bacteria and Archaea
Dongying Wu1,2, Philip Hugenholtz1, Konstantinos Mavromatis1, Rüdiger Pukall3, Eileen Dalin1, Natalia N. Ivanova1,
Victor Kunin1, Lynne Goodwin4, Martin Wu5, Brian J. Tindall3, Sean D. Hooper1, Amrita Pati1, Athanasios Lykidis1,
Stefan Spring3, Iain J. Anderson1, Patrik D’haeseleer1,6, Adam Zemla6, Mitchell Singer2, Alla Lapidus1, Matt Nolan1,
Alex Copeland1, Cliff Han4, Feng Chen1, Jan-Fang Cheng1, Susan Lucas1, Cheryl Kerfeld1, Elke Lang3,
Sabine Gronow3, Patrick Chain1,4, David Bruce4, Edward M. Rubin1, Nikos C. Kyrpides1, Hans-Peter Klenk3
& Jonathan A. Eisen1,2
Sequencing of bacterial and archaeal genomes has revolutionized our gene from across the tree of life7. Working from the root to the tips of
understanding of the many roles played by microorganisms1. There the tree, we identified the most divergent lineages that lacked repre-
are now nearly 1,000 completed bacterial and archaeal genomes sentatives with sequenced genomes (completed or in progress)8 and
available2, most of which were chosen for sequencing on the basis for which a species has been formally described9 and a type strain
of their physiology. As a result, the perspective provided by designated and deposited in a publicly accessible culture collection10.
the currently available genomes is limited by a highly biased phylo- From hundreds of candidates, 200 type strains were selected both to
genetic distribution3–5. To explore the value added by choosing obtain broad coverage across Bacteria and Archaea and to perform
microbial genomes for sequencing on the basis of their evolutionary in-depth sampling of a single phylum. The Gram-positive bacterial
relationships, we have sequenced and analysed the genomes of 56 phylum Actinobacteria was chosen for the latter purpose because of
culturable species of Bacteria and Archaea selected to maximize the availability of many phylogenetically and phenotypically diverse
phylogenetic coverage. Analysis of these genomes demonstrated cultured strains, and because it had the lowest percentage of sequenced
pronounced benefits (compared to an equivalent set of genomes isolates of any phylum (1% versus an average of 2.3%)11. Of the 200
randomly selected from the existing database) in diverse areas includ- targeted isolates, 159 were designated as ‘high’ priority primarily on the
ing the reconstruction of phylogenetic history, the discovery of basis of phylum-level novelty and the ability to obtain microgram quan-
new protein families and biological properties, and the prediction tities of high quality DNA. The genomes of these 159 are being
of functions for known genes from other organisms. Our results sequenced, assembled, annotated (including recommended metadata12)
strongly support the need for systematic ‘phylogenomic’ efforts to and finished, and relevant data are being released through a dedicated
compile a phylogeny-driven ‘Genomic Encyclopedia of Bacteria Integrated Microbial Genomes database portal13 and deposited into
and Archaea’ in order to derive maximum knowledge from exist- GenBank. Currently, data from 106 genomes (62 of which are finished)
ing microbial genome data as well as from genome sequences to are available.
come. To assess the ramifications of this tree-based selection of organisms,
Since the publication of the first complete bacterial genome, we focused our analyses on the first 56 genomes for which the shotgun
sequencing of the microbial world has accelerated beyond expecta- phase of sequencing was completed. The 53 bacteria and 3 archaea
tions. The inventory of bacterial and archaeal isolates with complete or (Supplementary Table 1) represent both a broad sampling of bacterial
draft sequences is approaching the two thousand mark2. Most of these diversity and a deeper sampling of the phylum Actinobacteria (26
genome sequences are the product of studies in which one or a few GEBA genomes). An initial question we addressed was whether selec-
isolates were targeted because of an interest in a specific characteristic tion on the basis of phylogenetic novelty of SSU rRNA genes reliably
of the organism. Although large-scale multi-isolate genome sequen- identifies genomes that are phylogenetically novel on the basis of other
cing studies have been performed, they have tended to be focused on criteria. This question arises because it is known that single genes, even
particular habitats or on the relatives of specific organisms. This over- SSU rRNA genes, do not perfectly predict genome-wide phylogenetic
all lack of broad phylogenetic considerations in the selection of micro- patterns14,15. To investigate this, we created a ‘genome tree’ (ref. 16) of
bial genomes for sequencing, combined with a cultivation bottleneck6, completed bacterial genomes (Fig. 1) and then measured the relative
has led to a strongly biased representation of recognized microbial contribution of the GEBA project using the phylogenetic diversity
phylogenetic diversity3–5. Although some projects have attempted to metric17. We found that the 53 GEBA bacteria accounted for 2.8–4.4
correct this (for example, see ref. 5), they have all been small in scope. times more phylogenetic diversity than randomly sampled subsets of
To evaluate the potential benefits of a more systematic effort, we 53 non-GEBA bacterial genomes. A similar degree of improvement in
embarked on a pilot project to sequence approximately 100 genomes phylogenetic diversity was seen for the more intensively sampled acti-
selected solely for their phylogenetic novelty: the ‘Genomic nobacteria (Table 1). These analyses indicate that although SSU rRNA
Encyclopedia of Bacteria and Archaea’ (GEBA). genes are not a perfect indicator of organismal evolution, their phylo-
Organisms were selected on the basis of their position in a phylo- genetic relationships are a sound predictor of phylogenetic novelty
genetic tree of small subunit (SSU) ribosomal RNA, the best sampled within the universal gene core present in bacterial genomes.
1
DOE Joint Genome Institute, Walnut Creek, California 94598, USA. 2University of California, Davis, Davis, California 95616, USA. 3DSMZ, German Collection of Microorganisms and
Cell Cultures, 38124 Braunschweig, Germany. 4DOE Joint Genome Institute-Los Alamos National Laboratory, Los Alamos, California 87545, USA. 5University of Virginia,
Charlottesville, Virginia 22904, USA. 6Lawrence Livermore National Laboratory, Livermore, California 94550, USA.
1056
©2009 Macmillan Publishers Limited. All rights reserved
NATURE | Vol 462 | 24/31 December 2009 LETTERS
365
AA
Caldic hermoan m thermmacu reducenicum 4C
CB
5334
Clostridium botu
ATC
Can oxydo Desu cteriu oaceolfei s m IA WNvula
11
Strepococcus faecalis V5 r 6b str SLCC
ellulo
Syn
Carb
ZYH33 oris SK
Clostridi s metalliredigen
cus
Bacillus pumilus SAFR 032 426
tro
Alicyclobacillus acidocaldarius
Pelo Desulf hydro terium estica CC 3ttinge 3
atus thermu toba m mo tica Atr Go 148 LF
i 23K
ridiu ulfoto
acetobutylicum
ilu
um phyto
emolytic
71
tra
K16
kaustophilus
83
Mo wolfe m th oph Veillo gala ynov B CT 3 1
He rella thsubs rmop s JWnella tiae Pe 53
5
na
45
ldia m
s
liob
Cl
o
Anaero
ihenstep
lfi
W ATC 56
a
C
o
ferme
9
agna
C
riu rm
1 79
Lac ptococ us mu
welshim
cocc C 29328
Ur
rm
lum
ticus
My yc My a hy sma eum llise C 7 ns Hr PG L1 li
M
IF
lum
m novyi NT
A yn ob fle s n te o V1 67
u
M
ATC
we
p w hil
ri
co opla co op ge o pt 00 F 1
pla
dii OhIL 0
TC
My
yc co ulm rit ob e G 9
M
a
ATC xida
La tobac us c
DSMs MB4
sm
op pla o id il 7 37
Bacillus
d
l
Lysiniba
co My plas plas 3 s a pe es a fl sm AY 8
My a p art a m on lium M1 R
m
Maree
N e y o e a e n um s
as
QY
pio P10 1
tor Z 29 51
sm ma sm um a e um 70
a
las sm nis is e 1 44
ilu
b cu cit
s
m
pla co m m tr ne SC or a W A
555
si
As
A
M
pa
C os et ba cte es is A
a
th cy cc m rm cc M 3B 74 fl
e is s s ryt PC el C1 a 2 1
MF
1
ma a U 15 63 8
sp sp p hr C on 10 13
Sy richstoc os hlor ccu iola ntia olzii cus
te
sm pla a p a g AT tra s um ma B
yc yc rova lasmyco las top sm ii P
2
n
M
89
m
As
Ente
u NM par G2
rv
BP
T
Fu ry
yc
o o r
la
R rp o ba id v
um My ubs Me tus hyt lai
AT PC PC aeu 73 gat 17
03
T o rm oc co r v ura nh tia
M
a s A 8L K
so el
oi
ns
05
C C C 7 m 10 us
lo
c
d
ba
e
w
1
es
ya ec ho sm c ho ar JA P
c
ct
10
C 6 0 IM 2
s
c lu
c ia IP
er
s
w
ec st u e e us BI
51 80 02 S
to
iu itc A
Ca s b olep
m
n a C
c
he ch
6
nu 5
nd roo la
37
La
cle 1
id m sm
3 30 P1
n is s
14 3
ia
84 6
a p a
at St
CM
2
T h
um re 6
id m la a G
S C 98
Th s pt IE C
N sP rC P1
er De ubs ob
a u 7 1 1 s st M
m th p Le aci os at 0 92 inu CC
y
2
an io n
ae su uc Se pto llus in ng C3 02 IT ar str
t
a
Fe ro lfo lea ba tri m r ug elo RC C99 311 tr M p m L2A ris
rv T v v t l c o ae s sp C 9 s bs T sto
Th ido The herm ibrio ibrioum dell hia nili is cu s p C us su NA pa
erm ba rm o a p A a t bu fo st oc cu s s p C rin s tr p 41
S n
os cter oto tog cida eptTCCerm cc rmi cy c c u s a inu s s bs 99
ro ho co cc us m ar u su
iph ium ga a le m id 2 it ali s ic ec ho co cc us m rin s SM
o m t in ov 5 id s M yn ec ho co occ cus ma arinu us D
Pe me nod arit ting ovo ora 586is S yn ec ho oc c us m hil
Bu tro lan os im ae ra ns S yn ec lor oco cc us op
e u n S yn ch lor co cc lan sei
chn T M tog sie m a M TM s S ro ch ro co xy oe m ns
Bu e
chn ra ap De herm Meio eio a m nsis Rt1 SB O P ro chlo ro ter r w ulu uce
ino us th he ob B 7 B 8 t
Wig
Bu era a hidico co th erm rm ilis I4 1 P ro hlo ac cte arv ed
gles chn phid la cc er us us SJ 29 P roc rob iba p inir m s
wort e s u s m o r 9 P ub ex ium liotr ta urtu xidan 27
hia
glo
ra a ico tr A
p la P
rad ph silv ube 5
iod ilus anu r R on ob he len m c roo 08 2705
ssin Buc hidic str S S Ac C top ia lla riu fer TW CC 1 09
idia hne ola g yr ura H s A lack rthe cte m plei m N 41 M 71 129
end ra ap str B Schiz thosip ns B8 S gge oba robiu hip ngu m K DS C 13
Bau Cand osy hid p B aph ho R1 E rypt ic a w lo ikeiu cum NCT
man id m C cidimerym erium m je alyti riae 14
nia atus B Cand bion icola s aizon is gra n pis
cica lo t g u A ph act teriu ure hthe YS 3 0216
dell chma idatus of Glotr Cc C ia pis minumm Troifidobebac teriumm dip iens SRS3
inic
ola nnia peBloch ssina inara taciae B oryn bac teriu effic rans
str H n m b c C ryne bac rium tole
c H nsylva annia revipaedri
oma n fl lpis Co ryne bacte radio ernae
Acti S lo icu orid Co ryne ccus cav ena
nob higella disca s str B anus
Haemacillus suflexneri coagula PEN Co eoco bergia flavig silytica
ophilu ccin 2a s ta Kineuten onas ellulo
s du ogene tr 301 B llulom nas c ddieii
Vibri c re s1 Ce nimo cter ke ans
Aerom o ch yi 35000 30Z Xyla guiba nitrific rius is
onas P
salmonhotobacterVibrio fischlerae O3 HP
o San esia de sedenta ium ganens
michi
icida ium pr eri E 95 Jon ococcus rium faec nsis subsp
Shew subsp sa ofundu S114 Kyt chybacte ichigane str CTCB07
Shewan anella sedi lmonicidam SS9 Bra bacter m bsp xyli
ella frigid minis HA A449 Clavi onia xyli su ila DC2201 33209
Psychro imarina NCIMW EB3 ATCC
monas B 400 Leifs ria rhizoph lmoninarum
Co
Pseudoalte lwellia psychreingrahamii 37 Kocu acterium sa 24
romonas ryt Renibobacter sp FB nes KPA171202
haloplanktishraea 34H Arthr ac ter ium ac
Idiomarina TAC125 Propionibflavida
ioih iensis L2T
Pseudoalteromo
nas atlantica R Kribbella es sp JS614
Kangiella koreensis T6c Nocardioid acidiphila
Marinomonas Catenulispora coelicolor A3 2
Chromohalobacter salexigens sp MWYL1 Streptomyces 11B
DSM 3043 Acidothermus cellulolyticus
Hahella chejuensis KCTC 2396 Thermobispora bispora
Marinobacter aquaeolei VT8 Streptosporangium roseum
Cellvibrio japonicus Ueda107 Thermomonospora curvata
Thermobifida fusca YX
Saccharophagus degradans 2 40 Nocardiopsis dassonvillei
e pv phaseolicola 1448A4 Frankia sp CcI3
Pseudomonas syringa us 273
Psychrobacter arctic r sp PRwf 1
Salinispora areni
cola CNS 205
Psychrobacte r sp ADP1 Stackebrand
Acinetobacte 2 Geodermatoptia nassauensis
ax bo rku mensis SK 1 Nakamure hilus obscur
us
Alcanivor RSA 33ns Actinosyn lla multipartita
burnetii
Coxiella mophila str LeHA Saccha ne ma mirum
lla pneu cius okutanii L 2 ro
Saccha monospora vir
Legione yo so a XC Mycob ropolys idis
sicom ogen larctica po
atus Ve ira crun ho Myc acterium ra erythraea
Candid Thiomicrosp is subsp S1703A Myc obacteriu abscessu NRRL
2338
larens sus VC ula1 Rho obacteriu m gilvum s
sella tu cter nodo sa TemecK279a No dococcu m leprae PYR GCK
Franci loba idio ilia th
Diche ylella fast maltoph s str BSaL1 Gordcardia fa s sp RH TN
X o nas psulatu phila E 1 Tsukaonia b rcinica A1
ho m ca a halo MLH 07 ro
Lep mure nchia M 101 IF
otrop cus i 7 Le tospir lla pa lis 52
Sten thylococ odospir hrliche CC 19 H 1
Me Halorh ola e ni AT s SP 1 2 Bra ptospir a bifle urometa
nic ocea oran EF0 J2 Tre chysp a inte xa sero bola
lilim v e Tre ponemira mu rrogan var Pa
Alka coccus acidoisenia rans C118 p o a rd s s to
so tia e o T 6 Bo n d oc erov c str
Nitro Delfbacter alenivucens ii SP 1 Bo rreli ema entic hii ar C ain
ro aphth ired lodn m PM 1 R rreli a he pallid ola A ope Patoc
h
p
ine nas n x ferr x cho hilu STIR is Plahodo a garirmsii D um s TCC nha
gen 1 Ames
V e rm o ra ri ip s ns M ncto pirell nii P AH ubsp 35405 i str
la rom odofe ptoth etrole sariu ane 383s Ak ethyla myc ula b Bi pall
id
Fioc
ruz
P o Rh e
L m p ece taiw sp an s Op kerm cid es li altic um L1
ia d N
libiu r n vidus lder oxy 197 1 C it u a ip h m n a S s tr Nic
thy cte o ic N Chandidtus tnsia ilum ophil H 1 hols
Me leoba upria urkh rsen viump Eb CB 1 Ch lam at err muc infe us
uc C B sa aa s aR 9 C la y u s e a in rn
o lyn o na etell rcus atic ha C 196 Ch hlammyddia t Pro PB9 iphila orum
P iim ord zoa rom rop 25 259 T C lo y o ra to 0 AT V
rm
in B A a u t C 5 K C hlo ro do phil cho chla 1 CC 4
He as e TC C 2 s 5 C hlo ro her ph a a ma m
on nas is A TC llatu 194 72 1 P hlo ro biu pe ila bo tis ydia
BA
A8
rom mo rm s A ge P1 124 83 6 Ch ros rob bac m p ton pne rtus A H am 35
c hlo roso ltifo can s fla CC C 2 4 4 9 R
S h r c
lo the ium ulu ha tha umo S2 AR oeb
6
De Nit mu itrifi cillu ae N ATC JCM sp 903 y2 m e la 13 op
C al od ob o p p ob ss nia 3 hil
ira en a e s P 18 C an inib oth ium chlo hae arv act ium e TW aU
sp s d ylob rrho eum ran riumTCCcus isB yt d a e c ris o um er A WE
ne lav rum ob MA ten s K lou 25 8
so i op id ct rm h b oid T 1
7
e
Dy pir ob op ide roid as ia mchra oph
ga s A rub m ch iof ro IB BS 35
P h te ac m S ga syc 80 P 3A
om cu min es m la lifo tr yi R
io ia s
hu mo er arin rom orm ides 832 1 110
C c b ro s a p 0 m
er ium ra hy in uto lu l
au m o ia B 09 M
Th
ac a l he pin aio tas alis ri G
Ba ara phy atu oph m KT utu p YO
a iz i li i u
iss ter um et sp r a pa
izo ru b ta ra s
D
tc eb S us at is D 7
P or did cyt eriu etii min s
ter ct ulo m C s 38 1
te ing pa en tao on W W
hi o M DS SM
P an no ct ors m ium
i
de er p ba aris C 1 DS 41
r f u rin sis m is 83 SS
ns p 1 i Ca M 2
C ap oba a f biu nib F5 5 2 SM 1 5144
n
er ale u
c
J nitr om cte M 54 1
n q in b
o ca ba o
C lav ell ro oge us V 15 s D CC
m ba on hil 38
Rh cocc eoba anna ifica ero r sp CS1 44
rto lla r w izo
D3 2 66
m
di o m
F am ic dr lic SB ne
ii us 55
gob Eryth odob us d cter schians O yi DS K3 0
ro ylo 65
en
in th do
Ba one cte yrh
h h AT as
Zym Sp arom cter r sphrifica ae p CC 114 3
ia an eu
ta
C et
rt ba ad
S uif u su e i 2 1
C iat
ns
ck X ps
Aqitratir ella cter hpylor C37 4018 SM 1
ona opy ticivotorali eroids PD FL 1 1
M C ic
Ba itro Br
N olin ba ter NB i RM ns D
rin
ran s H es 122 2
S
o 3 3 us
W elico bac sp tzler fica m
Sp obilis alask s DSTCC 2 4 2
d
Glu hingo subs ensis M 1225941
e
H elico ovumr bu enitri yianu CC B lei 269
ij o 40 5a
H lfur acte as d dele is AT doy 82 40
Be Rh
Gra aceto conob mona p mo RB2 444
Su ob on
6 2
Arc lfurim pirillu r ho ni su sp fe
bac acter acter s witti bilis Z256
ter b diaz oxy chii M4
N
a n D
icr AT
Mag dospirill cidiphili ensis Cicus P 21H
GDN Al 5
on C
Anap lasma magneti TCC 1 JF 5
De
a mar hagocytocum AM170
str hilum H 1
es
VP C 8
Solib lovibrio s xanthu Fw109
s
Bdelococcu acter sp
B
Welgestr Jake
M
Ana
Haliang ium cellulo licus DSM 23
m tis
I5 0
1
yama
Wilmington
Pelobacter
G20
vond
o
Desulfomicrobium baculatum
48 3
Syntrophobacte
Desulfococcus oleovo
Desulfotalea psychrophila LSv54
DP4
Pa m
bac
5
ob ic
biu
2
bellii RML36
se Sil
li
m
Candidatus Pelantia tsutsugamushi Bor
izo
str
JIP
ium oc
sp m ru um c
subsp desulfuricanssp
s
ra os
ue HTC
Rh
02
1
si
s m xis
Pa inor
sd
a
T
min bsp
tatu
ginale
86
Ro
typhi str
lovleyi SZ
D
m str
hr
74 9
Wol
om hing
gibacter ubiq
p
0
Rickettsia
sum So
itrophicus SB
voru
nn
Eh
u
b
ettsia se
251
s
lasm
76
tu
phin
AA
rans Hxd3
neto
nuli
ia rum
s
345
con
ce 56
381 7
vos
Rho
t
symbion
Glu
Orie
Desulfovibrio desulfuricans
No
Ehrlich
9
80
9
hia endo
Figure 1 | Maximum-likelihood phylogenetic tree of the bacterial domain based on a concatenated alignment of 31 broadly conserved protein-coding
genes16. Phyla are distinguished by colour of the branch and GEBA genomes are indicated in red in the outer circle of species names.
The discovery and characterization of new gene families and their as ecological niche; nevertheless, higher rates of novel protein family
associated novel functions provide one incentive for sequencing discovery were found in the more phylogenetically diverse taxa
additional genomes, analysis of which has helped to redefine the (Fig. 2). In addition, of the 16,797 families identified in the 56
protein family universe18. We explored the quantitative effect of GEBA genomes, 1,768 showed no significant sequence similarity to
tree-based genome selection on the pace of discovery of novel proteins any proteins, indicating the presence of novel functional diversity.
and functions. Specifically, we compared the rate of discovery of These results highlight the utility of tree-based genome selection as
novel protein families when progressively adding more closely related a means to maximize the identification of novel protein families and
genomes versus when adding more distantly related ones (Fig. 2). argues against lateral gene transfer significantly redistributing genetic
Granted, many factors contribute to protein family diversity, such novelty between distantly related lineages.
1057
©2009 Macmillan Publishers Limited. All rights reserved
LETTERS NATURE | Vol 462 | 24/31 December 2009
Table 1 | Effect of SSU rRNA tree-based selection of organisms on compar- families, respectively. Halorhabdus utahensis, a halophilic archaeon
ative genomic metrics known to have b-xylanase and b-xylosidase activities21, has a chromo-
Comparative genomic metric GEBA set Random sets Fold somal cluster including two GH10 family b-xylanases and six novel
(number of resamplings) improvement GH5 family proteins of unknown specificity.
The enrichment of genetic diversity is also seen within families of
Genome tree phylogenetic diversity17 non-coding RNAs, transposable elements, and other cellular compo-
Bacteria (domain) 11.0 3.2 6 0.7 (100) 2.8–4.4
Actinobacteria (phylum) 4.3 1.4 6 0.3 (100) 2.5–3.9 nents. For example, the genome of the marine myxobacterium
Haliangium ochraceum contains 807 CRISPR (clustered regularly
New protein family links 46 3 6 4 (5) 6.6 to .15.3 interspaced short palindromic repeats) units including the largest
Genes in new chromosomal cassettes 71,579 16,579 6 5,523 (20) 3.2–6.5 single CRISPR array known, comprising 382 spacer/repeat units.
New gene fusions 433 65 6 31 (20) 4.5–12.7
CRISPR is a newly recognized, but ancient and widespread, system
in bacteria and archaea that confers resistance to viruses and other
GEBA genomes were compared to equivalently sized random sets of reference genomes to
quantify the effect of phylogenetic selection. invading foreign DNAs22.
Results from the GEBA pilot project challenge our current under-
Novel proteins also can serve to link distantly related homologues standing for the taxonomic distribution of known gene families. The
whose relatedness would otherwise go undetected. Forty-six such most striking example of which is the discovery of an actin homo-
links were identified in the 56 GEBA genomes compared to an average logue in H. ochraceum. Actin and its close relatives are structural
of only three new links in equivalent sets of randomly sampled non- components of the eukaryotic cytoskeleton that are found in every
GEBA genomes (Table 1). A useful complement to homology-based eukaryote and only in eukaryotes. Bacteria and archaea encode
predictions of gene function are ‘non-homology methods’ (ref. 19) instead the shape-determining protein MreB. Although MreBs have
such as gene context-based inference that relies on the conserved some functional and structural similarities to eukaryotic actins, they
clustering of functionally related genes across multiple genomes, often are regarded, at best, distantly related homologues23 and possibly not
in operons or as gene fusions20. We identified over 70,000 genes in new
chromosomal cassettes of two or more genes in the GEBA genomes.
RNase treated
This represents a three- to sixfold increase over equivalent sets of non- a b
GEBA genomes (Table 1). Similarly, the number of new gene fusions
Markers
Peptidase M4
identified in the GEBA genomes is 4 to ,13 times greater than in thermolysin
DNA
RNA
randomly selected genome sets (Table 1). Because the GEBA data
set produced a several-fold improvement over random sets for all
metrics examined (Table 1), we predict that other aspects of Hypothetical protein
sequence-based biological discovery will similarly benefit from tree-
based genome sequencing. Hypothetical protein 1.6 kb
The GEBA genomes also show significant phylogenetic expansions 1.0 kb
within known protein families. For example, although only two of the Hypothetical protein
0.5 kb
56 GEBA organisms are known cellulose degraders, we identified in the
set of genomes a variety of glycoside hydrolase (GH) genes that may BARP
participate in the breakdown of cellulose and hemicelluloses. Among
Phosphatase c
these are 28 and 7 phylogenetically divergent members of the endo- 1 kb
glucanase- and processive exoglucanase-containing GH6 and GH48 Oxidoreductase
1C0F_A . . . . G D E A Q S K R G I L T L K Y P I E HG I V T N W D D M E K IW H H T F Y N E LR V A P E E
BARP V L C V G Q E A I D Q S A T V L L R H P V W SG I V G D W E A F A A VL R H T F Y R A LW V A P E E
40,000
Actinobacteria 1C0F_A H P V L L T E A P L . . . N P K A N R E K M TQ I M F E T F N T P A MY V A I Q A V L SL Y A S G R
BARP H P I V V T E S P H V Y R S F Q L R R E Q L TR L L F E T F H A P Q VA V C S E A A M SL Y A C G L
(649.6 new families/genome)
30,000 1C0F_A T T G I V M D S G D G V S H T V P I Y E G Y AL P H A I L R L D L A GR D L T D Y M M KI L T E R G
BARP D T G L V V S L G D F V S Y V A P V H R G A IV D A G L T F L E P D GR S I T E Y L S RL L L E R G
Enterobacteriaceae 1C0F_A Y S F T T T A E R E I V R D I K E K L A Y V AL D F E A E M Q T A A SS S A L E K S Y EL P D G Q V
20,000 (307.7 new families/genome) BARP H V F T S P E A L R L V R D I K E T L C Y V AD D V A K E . . A A R NA D S V E A T Y LL P N G E T
1C0F_A I T I G N E R F R C P E A L F Q P S F L G M ES A G I H E T T Y N S IM K C D V D I R KD L Y G N V
10,000 BARP L V L G N E R F R C P E V L F H P D L L G W ES P G L T D A V C N A IM K C D P S L Q AE L F G N I
Streptococcus agalactiae 1C0F_A V L S G G T T M F P G I A D R M N K E L T A LA P S T M K I K I I A PP E R K Y S V W IG G S I L A
(121.3 new families/genome) BARP V V T G G G S L F P G L S E R L Q R E L E Q RA P A E A P V H L L T RD D R R H L P W KG A A R F A
0
0 10 20 30 40 50 60 70 80 1C0F_A S L S T F Q Q M W I S K E E Y D E S G P S I VH R K C F
BARP R D A Q F A G F A L T R Q A Y E R H G A E L IY Q M . .
Genome number
Figure 2 | Rate of discovery of protein families as a function of Figure 3 | A bacterial homologue of actin. a, Genomic context of the bacterial
phylogenetic breadth of genomes. For each of four groupings (species, actin-related protein (BARP) gene within the genome of the marine
different strains of Streptococcus agalactiae; family, Enterobacteriaceae; Deltaproteobacterium H. ochraceum. Red, gene encoding BARP; white, genes
phylum, Actinobacteria; domain, GEBA bacteria), all proteins from that encoding hypothetical proteins; black, genes with functional annotations.
group were compared to each other to identify protein families. Then the b, RT–PCR demonstration of expression of the gene encoding BARP in
total number of protein families was calculated as genomes were H. ochraceum. c, Ribbon plot of the putative structure of BARP. d, Alignment
progressively sampled from the group (starting with one genome until all of BARP with actin from Dictyostelium discoideum29 with similarities in black
were sampled). This was done multiple times for each of the four groups shaded text. Secondary structure elements (arrows, beta-strands; bars, alpha-
using random starting seeds; the average and standard deviation were then helices) are colour-coded as in c. A phylogenetic tree including this protein is in
plotted. Supplementary Figure 1.
1058
©2009 Macmillan Publishers Limited. All rights reserved
NATURE | Vol 462 | 24/31 December 2009 LETTERS
even homologous. Like other bacteria, H. ochraceum encodes a bona tree of life. The pilot study presented here is a dedicated first step in
fide MreB protein, but in addition, it encodes a protein that is clearly this direction.
a member of the actin family, which we have named BARP (bacterial
actin-related protein; Fig. 3). Although we do not yet have evidence METHODS SUMMARY
for its precise function, BARP is expressed in H. ochraceum (Fig. 3b). Starting with a phylogenetic tree of SSU rRNA genes7, we identified major
Assuming that the H. ochraceum mreB orthologue performs the same branches that had no available genome sequences but for which cultured isolates
function as in other bacteria, and given that the myxobacteria, to were available in the DSMZ or ATCC culture collections. Selected isolates
(Supplementary Table 1a, b) from these branches were grown and DNA isolated
which this species belongs, are known to synthesize actin-targeting
(Supplementary Table 1c) and quality checked. DNA was then used for shotgun
toxins24, we propose that this BARP may be a dominant-negative genome sequencing by Sanger/ABI, Roche/454 and/or Illumina/Solexa technolo-
inhibitor of eukaryotic actin polymerization. Regardless of its precise gies (Supplementary Table 2). Sequence reads were assembled separately with
function, this first—and so far only—discovery of an expressed different assembly methods and the best draft assembly was used for annotation
homologue of eukaryotic actin in a member of the Bacteria highlights and as a starting point for genome completion (current genome status is in
the potential for novel and surprising biological discoveries given a Supplementary Table 2). Annotation (gene identification, functional prediction,
wider genomic sampling of the tree of life. etc.) was performed using the IMG system (http://img.jgi.doe.gov/geba); this was
We conclude that targeting microorganisms for genome sequencing done both after shotgun sequencing and again after genome completion. For ‘whole
genome tree’ analysis, a PHYML maximum likelihood phylogenetic tree of a con-
solely on the basis of phylogenetic considerations offers significant far-
catenated alignment of 31 marker genes was built using AMPHORA16. Phylogenetic
reaching benefits in diverse areas. Furthermore, the benefits of phylo- diversity was calculated as the sum of branch lengths in this and other trees. Protein
genetically driven genome sequencing show no sign of saturating with families were built for various genome sets by using the Markov clustering algo-
these first 56 genomes. A key question then lies in determining how rithm (MCL)28 to group proteins on the basis of ‘all versus all’ blastp searches. For
much bacterial and archaeal diversity remains to be sampled. Using analysis of phylogenetic diversity of organisms, a phylogenetic tree was built for a
SSU rRNA gene sequences as a proxy for organismal diversity (Fig. 4), combined alignment of SSU rRNA sequences from published genomes and a non-
we estimate that sequencing the genomes of only 1,520 phylogenetically redundant subset of greengenes SSU rRNA7. Further analysis of the genomes was
selected isolates could encompass half of the phylogenetic diversity done using IMG database queries and new computational analyses as described in
the main text, legends and Supplementary Methods.
represented by known cultured bacteria and archaea. Given the continu-
ing reductions in both the cost and difficulty in sequencing genomes25, Received 3 June; accepted 30 October 2009.
this is certainly a tractable target in the next few years.
1. Fraser, C. M., Eisen, J. A. & Salzberg, S. L. Microbial genome sequencing. Nature
However, the great majority of recognized bacterial and archaeal 406, 799–803 (2000).
diversity is not represented by pure cultures and an additional 9,218 2. Liolios, K., Mavromatis, K., Tavernarakis, N. & Kyrpides, N. C. The Genomes On
genome sequences from currently uncultured species would be Line Database (GOLD) in 2007: status of genomic and metagenomic projects and
required to capture 50% of this recognized diversity (Fig. 4). Such their associated metadata. Nucleic Acids Res. 36 (database issue), D475–D479
(2008).
an undertaking will require new approaches to culturing or proces- 3. Hugenholtz, P. Exploring prokaryotic diversity in the genomic era. Genome Biol. 3,
sing of multi-species samples using methods such as metagenomics26 REVIEWS0003.1–REVIEWS0003.8 (2002).
or physical isolation of cells from mixed populations followed by 4. Eisen, J. A. Assessing evolutionary relationships among microbes from whole-
whole genome amplification methods27. Obtaining reference gen- genome analysis. Curr. Opin. Microbiol. 3, 475–480 (2000).
5. Wu, D. et al. Complete genome sequence of the aerobic CO-oxidizing
omes for the uncultured microbial majority will be a natural exten- thermophile Thermomicrobium roseum. PLoS One 4, e4207 (2009).
sion of the GEBA project, the ultimate goal of which is to provide a 6. Pace, N. R. A molecular view of microbial diversity and the biosphere. Science 276,
phylogenetically balanced genomic representation of the microbial 734–740 (1997).
7. DeSantis, T. Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and
workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–5072 (2006).
1,200 8. Bernal, A., Ear, U. & Kyrpides, N. Genomes OnLine Database (GOLD): a monitor of
GEBA genomes
Pre-GEBA genomes genome projects world-wide. Nucleic Acids Res. 29, 126–127 (2001).
Organisms from the greengenes database 9. Lapage, S. P. et al., International Code of Nomenclature of Bacteria, 1990 Revision.
1,000 Organisms from the greengenes database (American Society for Microbiology, 1992).
(excluding environmental samples) 10. Ward, N., Eisen, J., Fraser, C. & Stackebrandt, E. Sequenced strains must be saved
from extinction. Nature 414, 148 (2001).
Phylogenetic diversity
800 11. Hugenholtz, P. & Kyrpides, N. C. A changing of the guard. Environ. Microbiol. 11,
120 551–553 (2009).
12. Field, D. et al. The minimum information about a genome sequence (MIGS)
100
specification. Nature Biotechnol. 26, 541–547 (2008).
600
80 13. Markowitz, V. M. et al. The integrated microbial genomes (IMG) system in 2007:
data content and analysis tool extensions. Nucleic Acids Res. 36 (database issue),
60 D528–D533 (2008).
400 14. Achtman, M. & Wagner, M. Microbial diversity and the genetic nature of
40
microbial species. Nature Rev. Microbiol. 6, 431–440 (2008).
20 15. Beiko, R. G., Doolittle, W. F. & Charlebois, R. L. The impact of reticulate evolution
200 on genome phylogeny. Syst. Biol. 57, 844–856 (2008).
0 16. Wu, M. & Eisen, J. A. A simple, fast, and accurate method of phylogenomic
0 400 800 1,200
inference. Genome Biol. 9, R151 (2008).
17. Pardi, F. & Goldman, N. Resource-aware taxon selection for maximizing
0
0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 phylogenetic diversity. Syst. Biol. 56, 431–444 (2007).
Number of organisms 18. Kunin, V., Cases, I., Enright, A. J., de Lorenzo, V. & Ouzounis, C. A. Myriads of
protein families, and still counting. Genome Biol. 4, 401 (2003).
Figure 4 | Phylogenetic diversity of bacteria and archaea on the basis 19. Marcotte, E. M. et al. Detecting protein function and protein-protein interactions
of SSU rRNA genes. Using a phylogenetic tree of unique SSU rRNA gene from genome sequences. Science 285, 751–753 (1999).
sequences7, phylogenetic diversity was measured for four subsets of this tree: 20. Enright, A. J., Iliopoulos, I., Kyrpides, N. C. & Ouzounis, C. A. Protein interaction
organisms with sequenced genomes pre-GEBA (blue), the GEBA organisms maps for complete genomes based on gene fusion events. Nature 402, 86–90
(1999).
(red), all cultured organisms (dark grey), and all available SSU rRNA genes
21. Wainø, M. & Ingvorsen, K. Production of b-xylanase and b-xylosidase by the
(light grey). For each subtree, taxa were sorted by their contribution to the extremely halophilic archaeon Halorhabdus utahensis. Extremophiles 7, 87–93 (2003).
subtree phylogenetic diversity30 and the cumulative phylogenetic diversity 22. Barrangou, R. et al. CRISPR provides acquired resistance against viruses in
was plotted from maximal (left) to the least (right). The inset magnifies the prokaryotes. Science 315, 1709–1712 (2007).
first 1,500 organisms. Comparison of the plots shows the phylogenetic ‘dark 23. Doolittle, R. F. & York, A. L. Bacterial actins? An evolutionary perspective.
matter’ left to be sampled. Bioessays 24, 293–296 (2002).
1059
©2009 Macmillan Publishers Limited. All rights reserved
LETTERS NATURE | Vol 462 | 24/31 December 2009
24. Sasse, F., Kunze, B., Gronewold, T. M. & Reichenbach, H. The chondramides: Author Contributions D.W. (rRNA analysis, gene families, actin tree, manuscript
cytostatic agents from myxobacteria acting on the actin cytoskeleton. J. Natl. preparation), P.H. (selection of strains, analysis, manuscript preparation, project
Cancer Inst. 90, 1559–1563 (1998). coordination), L.G. and D.B. (project management), R.P., B.J.T., E.L., S.G., S.S. (strain
25. Shendure, J. & Ji, H. Next-generation DNA sequencing. Nature Biotechnol. 26, curation and growth), K.M., N.N.I., I.J.A., S.D.H., A.P., A.Ly. (annotation, genome
1135–1145 (2008). analysis), V.K. (CRISPRs, actin), M.W. (whole genome tree), P.D., C.K., A.Z. and
26. Kunin, V., Copeland, A., Lapidus, A., Mavromatis, K. & Hugenholtz, P. A M.S. (actin studies), M.N., S.L., J.-F.C., F.C. and E.D. (sequencing), C.H., A.La., M.N.
bioinformatician’s guide to metagenomics. Microbiol. Mol. Biol. Rev. 72, 557–578 and A.C. (finishing), P.C. (analysis), E.M.R. (manuscript preparation), N.C.K.
(2008). (selection of strains, annotation, analysis), H.-P.K. (strain selection and growth,
27. Ishoey, T., Woyke, T., Stepanauskas, R., Novotny, M. & Lasken, R. S. Genomic DNA preparation, manuscript preparation), J.A.E. (project lead and coordination,
sequencing of single microbial cells from environmental samples. Curr. Opin. analysis, manuscript preparation).
Microbiol. 11, 198–204 (2008).
28. Enright, A. J., Van Dongen, S. & Ouzounis, C. A. An efficient algorithm for Author Information Genome sequence and annotation data is available at the JGI
large-scale detection of protein families. Nucleic Acids Res. 30, 1575–1584 (2002). IMG-GEBA page http://img.jgi.doe.gov/geba and has been submitted to GenBank
29. Matsuura, Y. et al. Structural basis for the higher Ca21-activation of the regulated with accessions ABSZ00000000, ABTA00000000, ABTB00000000,
actin-activated myosin ATPase observed with Dictyostelium/Tetrahymena actin ABTC00000000, ABTD00000000, CP001618, ABTF00000000,
chimeras. J. Mol. Biol. 296, 579–595 (2000). ABTG00000000, ABTH00000000, ABTI00000000, ABTJ00000000,
30. Moulton, V., Semple, C. & Steel, M. Optimizing phylogenetic diversity under ABTK00000000, ABTM00000000, NZ_ABTN00000000,
constraints. J. Theor. Biol. 246, 186–194 (2007). NZ_ABTO00000000, ABTP00000000, ABTQ00000000,
NZ_ABTR00000000, NZ_ABTS00000000, NZ_ABTT00000000,
Supplementary Information is linked to the online version of the paper at NZ_ABTU00000000, NZ_ABTV00000000, NZ_ABTW00000000,
www.nature.com/nature. NZ_ABTX00000000, NZ_ABTY00000000, NZ_ABTZ00000000,
Acknowledgements . We thank the following people for assistance in aspects of the NZ_ABUA00000000, NZ_ABUB00000000, NZ_ABUC00000000,
project including planning and discussions (R. Stevens, G. Olsen, R. Edwards, NZ_ABUD00000000., NZ_ABUE00000000, NZ_ABUF00000000,
J. Bristow, N. Ward, S. Baker, T. Lowe, J. Tiedje, G. Garrity, A. Darling, S. Giovannoni), NZ_ABUG00000000, NZ_ABUH00000000, NZ_ABUI00000000,
analysis of genomes whose work could not be included in this report (B. Henrissat, NZ_ABUJ00000000, NZ_ABUK00000000, ABUL00000000,
G. Xie, J. Kinney, I. Paulsen, N. Rawlings, M. Huntemann), project management NZ_ABUM00000000, NZ_ABUO00000000, NZ_ABUP00000000,
(M. Miller, M. Fenner, M. McGowen, A. Greiner), sequencing and finishing (K. Ikeda, ABUQ00000000, NZ_ABUR00000000, ABUS00000000,
M. Chovatia, P. Richardson, T. Glavinadelrio, C. Detter), culture growth, DNA NZ_ABUT00000000, NZ_ABUU00000000, NZ_ABUV00000000,
extraction, and metadata (D. Gleim, E. Brambilla, S. Schneider, M. Schröder, NZ_ABUW00000000, NZ_ABUX00000000, NZ_ABUZ00000000,
M. Jando, G. Gehrich-Schröter, C. Wahrenburg, K. Steenblock, S. Welnitz, M. Kopitz, NZ_ABVA00000000, NZ_ABVB00000000 and NZ_ABVC00000000. All
R. Fähnrich, H. Pomrenke, A. Schütze, M. Rohde, M. Göker), and manuscript editing strains that have been sequenced are available from the DSMZ culture collection
(M. Youle). This work was performed under the auspices of the US Department of and culture accessions are available in Supplementary Information. This paper is
Energy’s Office of Science, Biological and Environmental Research Program, and by distributed under the terms of the Creative Commons
the University of California, Lawrence Berkeley National Laboratory under contract Attribution-Non-Commercial-Share Alike licence, and is freely available to all
no. DE-AC02-05CH11231, Lawrence Livermore National Laboratory under contract readers at www.nature.com/nature. Reprints and permissions information is
no. DE-AC52-07NA27344, and Los Alamos National Laboratory under contract no. available at www.nature.com/reprints. Further details on sequencing and genome
DE-AC02-06NA25396. Support for J.A.E., D.W. and M.W. was provided by the properties of each organism are being published in the journal Standards in
Gordon and Betty Moore Foundation Grant no. 1660 to J.A.E. Support for work at Genomic Sciences (SIGS) (http://standardsingenomics.org/). Correspondence
DSMZ was provided under DFG INST 599/1-1. and requests for materials should be addressed to J.A.E. (jaeisen@ucdavis.edu).
1060
©2009 Macmillan Publishers Limited. All rights reserved