Sie sind auf Seite 1von 5

Vol 462 | 24/31 December 2009 | doi:10.

1038/nature08656

LETTERS
A phylogeny-driven genomic encyclopaedia of
Bacteria and Archaea
Dongying Wu1,2, Philip Hugenholtz1, Konstantinos Mavromatis1, Rüdiger Pukall3, Eileen Dalin1, Natalia N. Ivanova1,
Victor Kunin1, Lynne Goodwin4, Martin Wu5, Brian J. Tindall3, Sean D. Hooper1, Amrita Pati1, Athanasios Lykidis1,
Stefan Spring3, Iain J. Anderson1, Patrik D’haeseleer1,6, Adam Zemla6, Mitchell Singer2, Alla Lapidus1, Matt Nolan1,
Alex Copeland1, Cliff Han4, Feng Chen1, Jan-Fang Cheng1, Susan Lucas1, Cheryl Kerfeld1, Elke Lang3,
Sabine Gronow3, Patrick Chain1,4, David Bruce4, Edward M. Rubin1, Nikos C. Kyrpides1, Hans-Peter Klenk3
& Jonathan A. Eisen1,2

Sequencing of bacterial and archaeal genomes has revolutionized our gene from across the tree of life7. Working from the root to the tips of
understanding of the many roles played by microorganisms1. There the tree, we identified the most divergent lineages that lacked repre-
are now nearly 1,000 completed bacterial and archaeal genomes sentatives with sequenced genomes (completed or in progress)8 and
available2, most of which were chosen for sequencing on the basis for which a species has been formally described9 and a type strain
of their physiology. As a result, the perspective provided by designated and deposited in a publicly accessible culture collection10.
the currently available genomes is limited by a highly biased phylo- From hundreds of candidates, 200 type strains were selected both to
genetic distribution3–5. To explore the value added by choosing obtain broad coverage across Bacteria and Archaea and to perform
microbial genomes for sequencing on the basis of their evolutionary in-depth sampling of a single phylum. The Gram-positive bacterial
relationships, we have sequenced and analysed the genomes of 56 phylum Actinobacteria was chosen for the latter purpose because of
culturable species of Bacteria and Archaea selected to maximize the availability of many phylogenetically and phenotypically diverse
phylogenetic coverage. Analysis of these genomes demonstrated cultured strains, and because it had the lowest percentage of sequenced
pronounced benefits (compared to an equivalent set of genomes isolates of any phylum (1% versus an average of 2.3%)11. Of the 200
randomly selected from the existing database) in diverse areas includ- targeted isolates, 159 were designated as ‘high’ priority primarily on the
ing the reconstruction of phylogenetic history, the discovery of basis of phylum-level novelty and the ability to obtain microgram quan-
new protein families and biological properties, and the prediction tities of high quality DNA. The genomes of these 159 are being
of functions for known genes from other organisms. Our results sequenced, assembled, annotated (including recommended metadata12)
strongly support the need for systematic ‘phylogenomic’ efforts to and finished, and relevant data are being released through a dedicated
compile a phylogeny-driven ‘Genomic Encyclopedia of Bacteria Integrated Microbial Genomes database portal13 and deposited into
and Archaea’ in order to derive maximum knowledge from exist- GenBank. Currently, data from 106 genomes (62 of which are finished)
ing microbial genome data as well as from genome sequences to are available.
come. To assess the ramifications of this tree-based selection of organisms,
Since the publication of the first complete bacterial genome, we focused our analyses on the first 56 genomes for which the shotgun
sequencing of the microbial world has accelerated beyond expecta- phase of sequencing was completed. The 53 bacteria and 3 archaea
tions. The inventory of bacterial and archaeal isolates with complete or (Supplementary Table 1) represent both a broad sampling of bacterial
draft sequences is approaching the two thousand mark2. Most of these diversity and a deeper sampling of the phylum Actinobacteria (26
genome sequences are the product of studies in which one or a few GEBA genomes). An initial question we addressed was whether selec-
isolates were targeted because of an interest in a specific characteristic tion on the basis of phylogenetic novelty of SSU rRNA genes reliably
of the organism. Although large-scale multi-isolate genome sequen- identifies genomes that are phylogenetically novel on the basis of other
cing studies have been performed, they have tended to be focused on criteria. This question arises because it is known that single genes, even
particular habitats or on the relatives of specific organisms. This over- SSU rRNA genes, do not perfectly predict genome-wide phylogenetic
all lack of broad phylogenetic considerations in the selection of micro- patterns14,15. To investigate this, we created a ‘genome tree’ (ref. 16) of
bial genomes for sequencing, combined with a cultivation bottleneck6, completed bacterial genomes (Fig. 1) and then measured the relative
has led to a strongly biased representation of recognized microbial contribution of the GEBA project using the phylogenetic diversity
phylogenetic diversity3–5. Although some projects have attempted to metric17. We found that the 53 GEBA bacteria accounted for 2.8–4.4
correct this (for example, see ref. 5), they have all been small in scope. times more phylogenetic diversity than randomly sampled subsets of
To evaluate the potential benefits of a more systematic effort, we 53 non-GEBA bacterial genomes. A similar degree of improvement in
embarked on a pilot project to sequence approximately 100 genomes phylogenetic diversity was seen for the more intensively sampled acti-
selected solely for their phylogenetic novelty: the ‘Genomic nobacteria (Table 1). These analyses indicate that although SSU rRNA
Encyclopedia of Bacteria and Archaea’ (GEBA). genes are not a perfect indicator of organismal evolution, their phylo-
Organisms were selected on the basis of their position in a phylo- genetic relationships are a sound predictor of phylogenetic novelty
genetic tree of small subunit (SSU) ribosomal RNA, the best sampled within the universal gene core present in bacterial genomes.
1
DOE Joint Genome Institute, Walnut Creek, California 94598, USA. 2University of California, Davis, Davis, California 95616, USA. 3DSMZ, German Collection of Microorganisms and
Cell Cultures, 38124 Braunschweig, Germany. 4DOE Joint Genome Institute-Los Alamos National Laboratory, Los Alamos, California 87545, USA. 5University of Virginia,
Charlottesville, Virginia 22904, USA. 6Lawrence Livermore National Laboratory, Livermore, California 94550, USA.

1056
©2009 Macmillan Publishers Limited. All rights reserved
NATURE | Vol 462 | 24/31 December 2009 LETTERS

365
AA
Caldic hermoan m thermmacu reducenicum 4C

CB
5334
Clostridium botu

Clostridium botulinum B str Eklund 17B

Lact rococcus eri serova us JCSC1435


Clostridium

ATC
Can oxydo Desu cteriu oaceolfei s m IA WNvula

Clostridium beijerinckii NCIMB 8052


T

11
Strepococcus faecalis V5 r 6b str SLCC
ellulo
Syn

Carb

Staphy cillus sphaeri nensis KBAB4


Exiguobacterium sibiricum 255 15

ZYH33 oris SK
Clostridi s metalliredigen

Geobacillus illus iheyensis HTE831


Clost Des tomacu rmopro
did

Alkaliph hilus oremlandifficile 63 ii

cus
Bacillus pumilus SAFR 032 426
tro

La ctoba illus d asei A sp sake S10750


sirup robacte cellum aceto s MI 1 I

Alicyclobacillus acidocaldarius
Pelo Desulf hydro terium estica CC 3ttinge 3
atus thermu toba m mo tica Atr Go 148 LF

Clostridium perfringens str 13


Clostridium kluyveri DSM
ph mbio robiu

Lac noco ostoc saliv ri AT PC 4 ulgari


Listeria lococcus ha cus C3 41
HTA
Alkalip ostridium us prevot

i 23K
ridiu ulfoto

Finego ccharolygcongen C 274


om
S
Na

Stre tococc us suis 05bsp crem


ae
Des culum is audaoform iense Ice13

Bacillus clausii KSM C 125


tor sa
tom orud gen
y

acetobutylicum
ilu
um phyto

emolytic

71
tra

ona bacte s the

K16

Lac tobacillillus sake genes 159


i sub MGA

Oe ucon acillus gasse ticus D ubsp b


linum A3 str Loch ni E88

kaustophilus

83
Mo wolfe m th oph Veillo gala ynov B CT 3 1
He rella thsubs rmop s JWnella tiae Pe 53

La dio acil lus re ni P KM C118 3


ulfo

5
na

Pe ctob acil s oe reum s UC 3332


Clostridium teta C 824

Le ctob cillus helve eckii s 34

45
ldia m
s

liob

Cl
o

Lac tobac cus pyo tans UA


ha
Clostridiu ntans ISDg
e

Anaero

La ctoba cillus elbru TCC 3


Bacillus halodurans

G hlo eifl osip culu r th sp TC CFS1 C 257


s

ihenstep
lfi

W ATC 56
a

Strep tococc lactis su


r ten

C
o

ferme

9
agna

C
riu rm

L cto coc lus ute SU 1 20


i

the xvia ans

Hehermaero occo s bre ntaruaceus O 3


e

1 79
Lac ptococ us mu
welshim
cocc C 29328
Ur

rm

lum

ticus
My yc My a hy sma eum llise C 7 ns Hr PG L1 li
M

IF
lum

Sp halobacil lus p ento ntum 5


Oceanobac

no ho co iu tifo co ina 2 CC 10 1394 237


ea

m novyi NT

A yn ob fle s n te o V1 67
u
M

Deacto bacil cus pferme ri F27


yc

ATC

we
p w hil

ri
co opla co op ge o pt 00 F 1
pla

dii OhIL 0
TC
My

yc co ulm rit ob e G 9
M

S loe ro exu ho m erm BA C 3


op

a
ATC xida

La tobac us c

C yn ec ode pun ec m sp eus us J SM TCC


s
pla s pla ne nit nia ic 9 2

DSMs MB4
sm

op pla o id il 7 37

Bacillus
d
l

Lysiniba
co My plas plas 3 s a pe es a fl sm AY 8

My a p art a m on lium M1 R

m
Maree

N e y o e a e n um s
as

QY

Th car ech act xus cast aura rren philu


hafn ldum 907 n

pio P10 1
tor Z 29 51
sm ma sm um a e um 70
a

las sm nis is e 1 44

ilu

b cu cit

s
m

pla co m m tr ne SC or a W A

555
si
As

A
M
pa

C os et ba cte es is A
a

th cy cc m rm cc M 3B 74 fl
e is s s ryt PC el C1 a 2 1
MF

1
ma a U 15 63 8

sp sp p hr C on 10 13
Sy richstoc os hlor ccu iola ntia olzii cus
te

sm pla a p a g AT tra s um ma B
yc yc rova lasmyco las top sm ii P

2
n
M

89
m

As

Ente
u NM par G2
rv

BP
T
Fu ry

yc

o o r

la
R rp o ba id v
um My ubs Me tus hyt lai

AT PC PC aeu 73 gat 17
03

T o rm oc co r v ura nh tia
M

a s A 8L K
so el

oi

ns
05

C C C 7 m 10 us
lo

c
d
ba

se cop p m sop Phy opl dlaw

e
w

1
es

ya ec ho sm c ho ar JA P
c
ct

10
C 6 0 IM 2
s

c lu
c ia IP
er

s
w

ec st u e e us BI

51 80 02 S
to
iu itc A
Ca s b olep
m

n a C

c
he ch

6
nu 5
nd roo la
37

La
cle 1
id m sm
3 30 P1

n is s

14 3
ia
84 6
a p a
at St
CM

2
T h
um re 6

id m la a G
S C 98
Th s pt IE C
N sP rC P1
er De ubs ob
a u 7 1 1 s st M
m th p Le aci os at 0 92 inu CC

y
2
an io n
ae su uc Se pto llus in ng C3 02 IT ar str

t
a
Fe ro lfo lea ba tri m r ug elo RC C99 311 tr M p m L2A ris
rv T v v t l c o ae s sp C 9 s bs T sto
Th ido The herm ibrio ibrioum dell hia nili is cu s p C us su NA pa
erm ba rm o a p A a t bu fo st oc cu s s p C rin s tr p 41

S n
os cter oto tog cida eptTCCerm cc rmi cy c c u s a inu s s bs 99
ro ho co cc us m ar u su
iph ium ga a le m id 2 it ali s ic ec ho co cc us m rin s SM
o m t in ov 5 id s M yn ec ho co occ cus ma arinu us D
Pe me nod arit ting ovo ora 586is S yn ec ho oc c us m hil
Bu tro lan os im ae ra ns S yn ec lor oco cc us op
e u n S yn ch lor co cc lan sei
chn T M tog sie m a M TM s S ro ch ro co xy oe m ns
Bu e
chn ra ap De herm Meio eio a m nsis Rt1 SB O P ro chlo ro ter r w ulu uce
ino us th he ob B 7 B 8 t
Wig
Bu era a hidico co th erm rm ilis I4 1 P ro hlo ac cte arv ed
gles chn phid la cc er us us SJ 29 P roc rob iba p inir m s
wort e s u s m o r 9 P ub ex ium liotr ta urtu xidan 27
hia
glo
ra a ico tr A
p la P
rad ph silv ube 5
iod ilus anu r R on ob he len m c roo 08 2705
ssin Buc hidic str S S Ac C top ia lla riu fer TW CC 1 09
idia hne ola g yr ura H s A lack rthe cte m plei m N 41 M 71 129
end ra ap str B Schiz thosip ns B8 S gge oba robiu hip ngu m K DS C 13
Bau Cand osy hid p B aph ho R1 E rypt ic a w lo ikeiu cum NCT
man id m C cidimerym erium m je alyti riae 14
nia atus B Cand bion icola s aizon is gra n pis
cica lo t g u A ph act teriu ure hthe YS 3 0216
dell chma idatus of Glotr Cc C ia pis minumm Troifidobebac teriumm dip iens SRS3
inic
ola nnia peBloch ssina inara taciae B oryn bac teriu effic rans
str H n m b c C ryne bac rium tole
c H nsylva annia revipaedri
oma n fl lpis Co ryne bacte radio ernae
Acti S lo icu orid Co ryne ccus cav ena
nob higella disca s str B anus
Haemacillus suflexneri coagula PEN Co eoco bergia flavig silytica
ophilu ccin 2a s ta Kineuten onas ellulo
s du ogene tr 301 B llulom nas c ddieii
Vibri c re s1 Ce nimo cter ke ans
Aerom o ch yi 35000 30Z Xyla guiba nitrific rius is
onas P
salmonhotobacterVibrio fischlerae O3 HP
o San esia de sedenta ium ganens
michi
icida ium pr eri E 95 Jon ococcus rium faec nsis subsp
Shew subsp sa ofundu S114 Kyt chybacte ichigane str CTCB07
Shewan anella sedi lmonicidam SS9 Bra bacter m bsp xyli
ella frigid minis HA A449 Clavi onia xyli su ila DC2201 33209
Psychro imarina NCIMW EB3 ATCC
monas B 400 Leifs ria rhizoph lmoninarum
Co
Pseudoalte lwellia psychreingrahamii 37 Kocu acterium sa 24
romonas ryt Renibobacter sp FB nes KPA171202
haloplanktishraea 34H Arthr ac ter ium ac
Idiomarina TAC125 Propionibflavida
ioih iensis L2T
Pseudoalteromo
nas atlantica R Kribbella es sp JS614
Kangiella koreensis T6c Nocardioid acidiphila
Marinomonas Catenulispora coelicolor A3 2
Chromohalobacter salexigens sp MWYL1 Streptomyces 11B
DSM 3043 Acidothermus cellulolyticus
Hahella chejuensis KCTC 2396 Thermobispora bispora
Marinobacter aquaeolei VT8 Streptosporangium roseum
Cellvibrio japonicus Ueda107 Thermomonospora curvata
Thermobifida fusca YX
Saccharophagus degradans 2 40 Nocardiopsis dassonvillei
e pv phaseolicola 1448A4 Frankia sp CcI3
Pseudomonas syringa us 273
Psychrobacter arctic r sp PRwf 1
Salinispora areni
cola CNS 205
Psychrobacte r sp ADP1 Stackebrand
Acinetobacte 2 Geodermatoptia nassauensis
ax bo rku mensis SK 1 Nakamure hilus obscur
us
Alcanivor RSA 33ns Actinosyn lla multipartita
burnetii
Coxiella mophila str LeHA Saccha ne ma mirum
lla pneu cius okutanii L 2 ro
Saccha monospora vir
Legione yo so a XC Mycob ropolys idis
sicom ogen larctica po
atus Ve ira crun ho Myc acterium ra erythraea
Candid Thiomicrosp is subsp S1703A Myc obacteriu abscessu NRRL
2338
larens sus VC ula1 Rho obacteriu m gilvum s
sella tu cter nodo sa TemecK279a No dococcu m leprae PYR GCK
Franci loba idio ilia th
Diche ylella fast maltoph s str BSaL1 Gordcardia fa s sp RH TN
X o nas psulatu phila E 1 Tsukaonia b rcinica A1
ho m ca a halo MLH 07 ro
Lep mure nchia M 101 IF
otrop cus i 7 Le tospir lla pa lis 52
Sten thylococ odospir hrliche CC 19 H 1
Me Halorh ola e ni AT s SP 1 2 Bra ptospir a bifle urometa
nic ocea oran EF0 J2 Tre chysp a inte xa sero bola
lilim v e Tre ponemira mu rrogan var Pa
Alka coccus acidoisenia rans C118 p o a rd s s to
so tia e o T 6 Bo n d oc erov c str
Nitro Delfbacter alenivucens ii SP 1 Bo rreli ema entic hii ar C ain
ro aphth ired lodn m PM 1 R rreli a he pallid ola A ope Patoc
h
p
ine nas n x ferr x cho hilu STIR is Plahodo a garirmsii D um s TCC nha
gen 1 Ames
V e rm o ra ri ip s ns M ncto pirell nii P AH ubsp 35405 i str
la rom odofe ptoth etrole sariu ane 383s Ak ethyla myc ula b Bi pall
id
Fioc
ruz
P o Rh e
L m p ece taiw sp an s Op kerm cid es li altic um L1
ia d N
libiu r n vidus lder oxy 197 1 C it u a ip h m n a S s tr Nic
thy cte o ic N Chandidtus tnsia ilum ophil H 1 hols
Me leoba upria urkh rsen viump Eb CB 1 Ch lam at err muc infe us
uc C B sa aa s aR 9 C la y u s e a in rn
o lyn o na etell rcus atic ha C 196 Ch hlammyddia t Pro PB9 iphila orum
P iim ord zoa rom rop 25 259 T C lo y o ra to 0 AT V
rm
in B A a u t C 5 K C hlo ro do phil cho chla 1 CC 4
He as e TC C 2 s 5 C hlo ro her ph a a ma m
on nas is A TC llatu 194 72 1 P hlo ro biu pe ila bo tis ydia
BA
A8
rom mo rm s A ge P1 124 83 6 Ch ros rob bac m p ton pne rtus A H am 35
c hlo roso ltifo can s fla CC C 2 4 4 9 R
S h r c
lo the ium ulu ha tha umo S2 AR oeb
6
De Nit mu itrifi cillu ae N ATC JCM sp 903 y2 m e la 13 op
C al od ob o p p ob ss nia 3 hil
ira en a e s P 18 C an inib oth ium chlo hae arv act ium e TW aU
sp s d ylob rrho eum ran riumTCCcus isB yt d a e c ris o um er A WE
ne lav rum ob MA ten s K lou 25 8

so i op id ct rm h b oid T 1
7

e
Dy pir ob op ide roid as ia mchra oph

ro cillu eth no lac tol cte A h B


Ma ptu ame bv ium FF3 sis C58 se 5

ha atu er us loro vibr acte NC es CC 83 25


on lum os orh lot me rm To Nb S2

S ed itin ro te on ulc o hr 3 ei19 OP1

Nit ba M go vio dio oba dica trop tris


ad oso ac hag s th es gin ue ce ilum

ga s A rub m ch iof ro IB BS 35
P h te ac m S ga syc 80 P 3A
om cu min es m la lifo tr yi R

ric niu ntiv vic sp 03 16 3

ob m ter a et dis giv lle a


ph iba gu M biu cel acil na s dsk p O

io ia s
hu mo er arin rom orm ides 832 1 110
C c b ro s a p 0 m

er ium ra hy in uto lu l
au m o ia B 09 M

Th
ac a l he pin aio tas alis ri G
Ba ara phy atu oph m KT utu p YO
a iz i li i u

ac iba Ca lis AT ran e NC 9

iss ter um et sp r a pa
izo ru b ta ra s

D
tc eb S us at is D 7
P or did cyt eriu etii min s
ter ct ulo m C s 38 1

te ing pa en tao on W W

Ne bac teri M sub cte as


rh B ella uin og ium

hi o M DS SM
P an no ct ors m ium

i
de er p ba aris C 1 DS 41

r f u rin sis m is 83 SS

ns p 1 i Ca M 2
C ap oba a f biu nib F5 5 2 SM 1 5144

n
er ale u

c
J nitr om cte M 54 1
n q in b

o ca ba o
C lav ell ro oge us V 15 s D CC

m ba on hil 38
Rh cocc eoba anna ifica ero r sp CS1 44
rto lla r w izo

D3 2 66
m

di o m
F am ic dr lic SB ne

ii us 55
gob Eryth odob us d cter schians O yi DS K3 0

ro ylo 65
en

in th do
Ba one cte yrh

Gr lusimrihy aeo r sp oge us A 5


ium rob acte enit shib s Ch S 1

h h AT as
Zym Sp arom cter r sphrifica ae p CC 114 3

E ulfu ex pto ccin patic 669

ia an eu
ta

C et
rt ba ad

S uif u su e i 2 1

C iat
ns

ck X ps
Aqitratir ella cter hpylor C37 4018 SM 1
ona opy ticivotorali eroids PD FL 1 1

M C ic
Ba itro Br

N olin ba ter NB i RM ns D

rin
ran s H es 122 2
S

o 3 3 us
W elico bac sp tzler fica m
Sp obilis alask s DSTCC 2 4 2

d
Glu hingo subs ensis M 1225941

e
H elico ovumr bu enitri yianu CC B lei 269

ij o 40 5a
H lfur acte as d dele is AT doy 82 40

Be Rh
Gra aceto conob mona p mo RB2 444

Su ob on

6 2
Arc lfurim pirillu r ho ni su sp fe
bac acter acter s witti bilis Z256
ter b diaz oxy chii M4
N

a n D

Su uros bacte r jeju s sub 3826


1

Sulfmpylo bacte r fetu cisus 1

icr AT
Mag dospirill cidiphili ensis Cicus P 21H

Ca mpylo bacte r con ilus


ethe otro dan RW

GDN Al 5

Ca mpylo bacte cetiph um Ellin


Ca mpylo rio a acteri
Anap irillum brum A ryptum IH1

on C
Anap lasma magneti TCC 1 JF 5

Ca nitrovib teria b s Ellin60 HD100


o

De
a mar hagocytocum AM170
str hilum H 1
es

Acid acter u bacterios DK 1622 5


ph s 6

inantiurlichia cani St Marie Z

VP C 8
Solib lovibrio s xanthu Fw109
s

Bdelococcu acter sp
B

Welgestr Jake
M

Myx eromyxob aceum


Neorick strain TRS ofbachia pipien en

Ana
Haliang ium cellulo licus DSM 23
m tis

I5 0
1

Rickettsia etsu str Miya alayi

Sorangacter carbino ucens Rf4


as
Hy rv le

yama
Wilmington

Pelob ter uraniired cens PCA


Geobac ter sulfurredu
Geobac
9C
yong
MC 1

Geobacter propionicus DSM 237


Magnetococcus C1062

Pelobacter
G20
vond

o
Desulfomicrobium baculatum

Syntrophus acidr fumaroxidans MPOB


Desulfohalobium retbaense

48 3
Syntrophobacte
Desulfococcus oleovo
Desulfotalea psychrophila LSv54
DP4
Pa m

Lawsonia intracellularis PHE MN100

bac

5
ob ic
biu

2
bellii RML36
se Sil

li

m
Candidatus Pelantia tsutsugamushi Bor
izo

str

JIP
ium oc
sp m ru um c

Desulfovibrio vulgaris subsp vulgaris


Brugia

subsp desulfuricanssp
s
ra os

ue HTC
Rh

02
1
si
s m xis
Pa inor

sd
a

T
min bsp
tatu
ginale

86
Ro

typhi str

lovleyi SZ
D

m str

hr

74 9
Wol
om hing

gibacter ubiq
p

0
Rickettsia

sum So
itrophicus SB

voru
nn
Eh
u
b

ettsia se

251
s
lasm

76

tu
phin

AA
rans Hxd3
neto
nuli

ia rum

s
345
con

ce 56

381 7
vos

Rho

t
symbion
Glu

Orie

Desulfovibrio desulfuricans
No

Ehrlich

9
80
9
hia endo

Gammaproteobacteria Aquificae Actinobacteria Synergistetes


Wolbac

Betaproteobacteria Bacteroidetes Cyanobacteria Thermotogae


Alphaproteobacteria Chlorobi Chloroflexi Deinococcus/Thermus
Deltaproteobacteria Chlamydiae/Verrucomicrobia Firmicutes
Epsilonproteobacteria Planctomycetes Tenericutes
Acidobacteria Spirochaetes Fusobacteria

Figure 1 | Maximum-likelihood phylogenetic tree of the bacterial domain based on a concatenated alignment of 31 broadly conserved protein-coding
genes16. Phyla are distinguished by colour of the branch and GEBA genomes are indicated in red in the outer circle of species names.

The discovery and characterization of new gene families and their as ecological niche; nevertheless, higher rates of novel protein family
associated novel functions provide one incentive for sequencing discovery were found in the more phylogenetically diverse taxa
additional genomes, analysis of which has helped to redefine the (Fig. 2). In addition, of the 16,797 families identified in the 56
protein family universe18. We explored the quantitative effect of GEBA genomes, 1,768 showed no significant sequence similarity to
tree-based genome selection on the pace of discovery of novel proteins any proteins, indicating the presence of novel functional diversity.
and functions. Specifically, we compared the rate of discovery of These results highlight the utility of tree-based genome selection as
novel protein families when progressively adding more closely related a means to maximize the identification of novel protein families and
genomes versus when adding more distantly related ones (Fig. 2). argues against lateral gene transfer significantly redistributing genetic
Granted, many factors contribute to protein family diversity, such novelty between distantly related lineages.
1057
©2009 Macmillan Publishers Limited. All rights reserved
LETTERS NATURE | Vol 462 | 24/31 December 2009

Table 1 | Effect of SSU rRNA tree-based selection of organisms on compar- families, respectively. Halorhabdus utahensis, a halophilic archaeon
ative genomic metrics known to have b-xylanase and b-xylosidase activities21, has a chromo-
Comparative genomic metric GEBA set Random sets Fold somal cluster including two GH10 family b-xylanases and six novel
(number of resamplings) improvement GH5 family proteins of unknown specificity.
The enrichment of genetic diversity is also seen within families of
Genome tree phylogenetic diversity17 non-coding RNAs, transposable elements, and other cellular compo-
Bacteria (domain) 11.0 3.2 6 0.7 (100) 2.8–4.4
Actinobacteria (phylum) 4.3 1.4 6 0.3 (100) 2.5–3.9 nents. For example, the genome of the marine myxobacterium
Haliangium ochraceum contains 807 CRISPR (clustered regularly
New protein family links 46 3 6 4 (5) 6.6 to .15.3 interspaced short palindromic repeats) units including the largest
Genes in new chromosomal cassettes 71,579 16,579 6 5,523 (20) 3.2–6.5 single CRISPR array known, comprising 382 spacer/repeat units.
New gene fusions 433 65 6 31 (20) 4.5–12.7
CRISPR is a newly recognized, but ancient and widespread, system
in bacteria and archaea that confers resistance to viruses and other
GEBA genomes were compared to equivalently sized random sets of reference genomes to
quantify the effect of phylogenetic selection. invading foreign DNAs22.
Results from the GEBA pilot project challenge our current under-
Novel proteins also can serve to link distantly related homologues standing for the taxonomic distribution of known gene families. The
whose relatedness would otherwise go undetected. Forty-six such most striking example of which is the discovery of an actin homo-
links were identified in the 56 GEBA genomes compared to an average logue in H. ochraceum. Actin and its close relatives are structural
of only three new links in equivalent sets of randomly sampled non- components of the eukaryotic cytoskeleton that are found in every
GEBA genomes (Table 1). A useful complement to homology-based eukaryote and only in eukaryotes. Bacteria and archaea encode
predictions of gene function are ‘non-homology methods’ (ref. 19) instead the shape-determining protein MreB. Although MreBs have
such as gene context-based inference that relies on the conserved some functional and structural similarities to eukaryotic actins, they
clustering of functionally related genes across multiple genomes, often are regarded, at best, distantly related homologues23 and possibly not
in operons or as gene fusions20. We identified over 70,000 genes in new
chromosomal cassettes of two or more genes in the GEBA genomes.

RNase treated
This represents a three- to sixfold increase over equivalent sets of non- a b
GEBA genomes (Table 1). Similarly, the number of new gene fusions

Markers
Peptidase M4
identified in the GEBA genomes is 4 to ,13 times greater than in thermolysin

DNA

RNA
randomly selected genome sets (Table 1). Because the GEBA data
set produced a several-fold improvement over random sets for all
metrics examined (Table 1), we predict that other aspects of Hypothetical protein
sequence-based biological discovery will similarly benefit from tree-
based genome sequencing. Hypothetical protein 1.6 kb
The GEBA genomes also show significant phylogenetic expansions 1.0 kb
within known protein families. For example, although only two of the Hypothetical protein
0.5 kb
56 GEBA organisms are known cellulose degraders, we identified in the
set of genomes a variety of glycoside hydrolase (GH) genes that may BARP
participate in the breakdown of cellulose and hemicelluloses. Among
Phosphatase c
these are 28 and 7 phylogenetically divergent members of the endo- 1 kb
glucanase- and processive exoglucanase-containing GH6 and GH48 Oxidoreductase

70,000 Conserved hypothetical protein


Hypothetical protein
Methyltransferase
(including families with single members)

60,000 Bacteria from GEBA project


(1,060.6 new families/genome) d
1C0F_A D G E D V Q A L V I D N G S G M C K A G F A GD D A P R A V F P S I VG R P R H T G K DS Y V . . .
50,000 BARP S S . . . . P I I I H P G S D T L Q A G L A D E E H P G S I F P N I VG R H K L A G L M E W V D Q R
Protein family number

1C0F_A . . . . G D E A Q S K R G I L T L K Y P I E HG I V T N W D D M E K IW H H T F Y N E LR V A P E E
BARP V L C V G Q E A I D Q S A T V L L R H P V W SG I V G D W E A F A A VL R H T F Y R A LW V A P E E
40,000
Actinobacteria 1C0F_A H P V L L T E A P L . . . N P K A N R E K M TQ I M F E T F N T P A MY V A I Q A V L SL Y A S G R
BARP H P I V V T E S P H V Y R S F Q L R R E Q L TR L L F E T F H A P Q VA V C S E A A M SL Y A C G L
(649.6 new families/genome)
30,000 1C0F_A T T G I V M D S G D G V S H T V P I Y E G Y AL P H A I L R L D L A GR D L T D Y M M KI L T E R G
BARP D T G L V V S L G D F V S Y V A P V H R G A IV D A G L T F L E P D GR S I T E Y L S RL L L E R G
Enterobacteriaceae 1C0F_A Y S F T T T A E R E I V R D I K E K L A Y V AL D F E A E M Q T A A SS S A L E K S Y EL P D G Q V
20,000 (307.7 new families/genome) BARP H V F T S P E A L R L V R D I K E T L C Y V AD D V A K E . . A A R NA D S V E A T Y LL P N G E T

1C0F_A I T I G N E R F R C P E A L F Q P S F L G M ES A G I H E T T Y N S IM K C D V D I R KD L Y G N V
10,000 BARP L V L G N E R F R C P E V L F H P D L L G W ES P G L T D A V C N A IM K C D P S L Q AE L F G N I
Streptococcus agalactiae 1C0F_A V L S G G T T M F P G I A D R M N K E L T A LA P S T M K I K I I A PP E R K Y S V W IG G S I L A
(121.3 new families/genome) BARP V V T G G G S L F P G L S E R L Q R E L E Q RA P A E A P V H L L T RD D R R H L P W KG A A R F A
0
0 10 20 30 40 50 60 70 80 1C0F_A S L S T F Q Q M W I S K E E Y D E S G P S I VH R K C F
BARP R D A Q F A G F A L T R Q A Y E R H G A E L IY Q M . .
Genome number

Figure 2 | Rate of discovery of protein families as a function of Figure 3 | A bacterial homologue of actin. a, Genomic context of the bacterial
phylogenetic breadth of genomes. For each of four groupings (species, actin-related protein (BARP) gene within the genome of the marine
different strains of Streptococcus agalactiae; family, Enterobacteriaceae; Deltaproteobacterium H. ochraceum. Red, gene encoding BARP; white, genes
phylum, Actinobacteria; domain, GEBA bacteria), all proteins from that encoding hypothetical proteins; black, genes with functional annotations.
group were compared to each other to identify protein families. Then the b, RT–PCR demonstration of expression of the gene encoding BARP in
total number of protein families was calculated as genomes were H. ochraceum. c, Ribbon plot of the putative structure of BARP. d, Alignment
progressively sampled from the group (starting with one genome until all of BARP with actin from Dictyostelium discoideum29 with similarities in black
were sampled). This was done multiple times for each of the four groups shaded text. Secondary structure elements (arrows, beta-strands; bars, alpha-
using random starting seeds; the average and standard deviation were then helices) are colour-coded as in c. A phylogenetic tree including this protein is in
plotted. Supplementary Figure 1.
1058
©2009 Macmillan Publishers Limited. All rights reserved
NATURE | Vol 462 | 24/31 December 2009 LETTERS

even homologous. Like other bacteria, H. ochraceum encodes a bona tree of life. The pilot study presented here is a dedicated first step in
fide MreB protein, but in addition, it encodes a protein that is clearly this direction.
a member of the actin family, which we have named BARP (bacterial
actin-related protein; Fig. 3). Although we do not yet have evidence METHODS SUMMARY
for its precise function, BARP is expressed in H. ochraceum (Fig. 3b). Starting with a phylogenetic tree of SSU rRNA genes7, we identified major
Assuming that the H. ochraceum mreB orthologue performs the same branches that had no available genome sequences but for which cultured isolates
function as in other bacteria, and given that the myxobacteria, to were available in the DSMZ or ATCC culture collections. Selected isolates
(Supplementary Table 1a, b) from these branches were grown and DNA isolated
which this species belongs, are known to synthesize actin-targeting
(Supplementary Table 1c) and quality checked. DNA was then used for shotgun
toxins24, we propose that this BARP may be a dominant-negative genome sequencing by Sanger/ABI, Roche/454 and/or Illumina/Solexa technolo-
inhibitor of eukaryotic actin polymerization. Regardless of its precise gies (Supplementary Table 2). Sequence reads were assembled separately with
function, this first—and so far only—discovery of an expressed different assembly methods and the best draft assembly was used for annotation
homologue of eukaryotic actin in a member of the Bacteria highlights and as a starting point for genome completion (current genome status is in
the potential for novel and surprising biological discoveries given a Supplementary Table 2). Annotation (gene identification, functional prediction,
wider genomic sampling of the tree of life. etc.) was performed using the IMG system (http://img.jgi.doe.gov/geba); this was
We conclude that targeting microorganisms for genome sequencing done both after shotgun sequencing and again after genome completion. For ‘whole
genome tree’ analysis, a PHYML maximum likelihood phylogenetic tree of a con-
solely on the basis of phylogenetic considerations offers significant far-
catenated alignment of 31 marker genes was built using AMPHORA16. Phylogenetic
reaching benefits in diverse areas. Furthermore, the benefits of phylo- diversity was calculated as the sum of branch lengths in this and other trees. Protein
genetically driven genome sequencing show no sign of saturating with families were built for various genome sets by using the Markov clustering algo-
these first 56 genomes. A key question then lies in determining how rithm (MCL)28 to group proteins on the basis of ‘all versus all’ blastp searches. For
much bacterial and archaeal diversity remains to be sampled. Using analysis of phylogenetic diversity of organisms, a phylogenetic tree was built for a
SSU rRNA gene sequences as a proxy for organismal diversity (Fig. 4), combined alignment of SSU rRNA sequences from published genomes and a non-
we estimate that sequencing the genomes of only 1,520 phylogenetically redundant subset of greengenes SSU rRNA7. Further analysis of the genomes was
selected isolates could encompass half of the phylogenetic diversity done using IMG database queries and new computational analyses as described in
the main text, legends and Supplementary Methods.
represented by known cultured bacteria and archaea. Given the continu-
ing reductions in both the cost and difficulty in sequencing genomes25, Received 3 June; accepted 30 October 2009.
this is certainly a tractable target in the next few years.
1. Fraser, C. M., Eisen, J. A. & Salzberg, S. L. Microbial genome sequencing. Nature
However, the great majority of recognized bacterial and archaeal 406, 799–803 (2000).
diversity is not represented by pure cultures and an additional 9,218 2. Liolios, K., Mavromatis, K., Tavernarakis, N. & Kyrpides, N. C. The Genomes On
genome sequences from currently uncultured species would be Line Database (GOLD) in 2007: status of genomic and metagenomic projects and
required to capture 50% of this recognized diversity (Fig. 4). Such their associated metadata. Nucleic Acids Res. 36 (database issue), D475–D479
(2008).
an undertaking will require new approaches to culturing or proces- 3. Hugenholtz, P. Exploring prokaryotic diversity in the genomic era. Genome Biol. 3,
sing of multi-species samples using methods such as metagenomics26 REVIEWS0003.1–REVIEWS0003.8 (2002).
or physical isolation of cells from mixed populations followed by 4. Eisen, J. A. Assessing evolutionary relationships among microbes from whole-
whole genome amplification methods27. Obtaining reference gen- genome analysis. Curr. Opin. Microbiol. 3, 475–480 (2000).
5. Wu, D. et al. Complete genome sequence of the aerobic CO-oxidizing
omes for the uncultured microbial majority will be a natural exten- thermophile Thermomicrobium roseum. PLoS One 4, e4207 (2009).
sion of the GEBA project, the ultimate goal of which is to provide a 6. Pace, N. R. A molecular view of microbial diversity and the biosphere. Science 276,
phylogenetically balanced genomic representation of the microbial 734–740 (1997).
7. DeSantis, T. Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and
workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–5072 (2006).
1,200 8. Bernal, A., Ear, U. & Kyrpides, N. Genomes OnLine Database (GOLD): a monitor of
GEBA genomes
Pre-GEBA genomes genome projects world-wide. Nucleic Acids Res. 29, 126–127 (2001).
Organisms from the greengenes database 9. Lapage, S. P. et al., International Code of Nomenclature of Bacteria, 1990 Revision.
1,000 Organisms from the greengenes database (American Society for Microbiology, 1992).
(excluding environmental samples) 10. Ward, N., Eisen, J., Fraser, C. & Stackebrandt, E. Sequenced strains must be saved
from extinction. Nature 414, 148 (2001).
Phylogenetic diversity

800 11. Hugenholtz, P. & Kyrpides, N. C. A changing of the guard. Environ. Microbiol. 11,
120 551–553 (2009).
12. Field, D. et al. The minimum information about a genome sequence (MIGS)
100
specification. Nature Biotechnol. 26, 541–547 (2008).
600
80 13. Markowitz, V. M. et al. The integrated microbial genomes (IMG) system in 2007:
data content and analysis tool extensions. Nucleic Acids Res. 36 (database issue),
60 D528–D533 (2008).
400 14. Achtman, M. & Wagner, M. Microbial diversity and the genetic nature of
40
microbial species. Nature Rev. Microbiol. 6, 431–440 (2008).
20 15. Beiko, R. G., Doolittle, W. F. & Charlebois, R. L. The impact of reticulate evolution
200 on genome phylogeny. Syst. Biol. 57, 844–856 (2008).
0 16. Wu, M. & Eisen, J. A. A simple, fast, and accurate method of phylogenomic
0 400 800 1,200
inference. Genome Biol. 9, R151 (2008).
17. Pardi, F. & Goldman, N. Resource-aware taxon selection for maximizing
0
0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 phylogenetic diversity. Syst. Biol. 56, 431–444 (2007).
Number of organisms 18. Kunin, V., Cases, I., Enright, A. J., de Lorenzo, V. & Ouzounis, C. A. Myriads of
protein families, and still counting. Genome Biol. 4, 401 (2003).
Figure 4 | Phylogenetic diversity of bacteria and archaea on the basis 19. Marcotte, E. M. et al. Detecting protein function and protein-protein interactions
of SSU rRNA genes. Using a phylogenetic tree of unique SSU rRNA gene from genome sequences. Science 285, 751–753 (1999).
sequences7, phylogenetic diversity was measured for four subsets of this tree: 20. Enright, A. J., Iliopoulos, I., Kyrpides, N. C. & Ouzounis, C. A. Protein interaction
organisms with sequenced genomes pre-GEBA (blue), the GEBA organisms maps for complete genomes based on gene fusion events. Nature 402, 86–90
(1999).
(red), all cultured organisms (dark grey), and all available SSU rRNA genes
21. Wainø, M. & Ingvorsen, K. Production of b-xylanase and b-xylosidase by the
(light grey). For each subtree, taxa were sorted by their contribution to the extremely halophilic archaeon Halorhabdus utahensis. Extremophiles 7, 87–93 (2003).
subtree phylogenetic diversity30 and the cumulative phylogenetic diversity 22. Barrangou, R. et al. CRISPR provides acquired resistance against viruses in
was plotted from maximal (left) to the least (right). The inset magnifies the prokaryotes. Science 315, 1709–1712 (2007).
first 1,500 organisms. Comparison of the plots shows the phylogenetic ‘dark 23. Doolittle, R. F. & York, A. L. Bacterial actins? An evolutionary perspective.
matter’ left to be sampled. Bioessays 24, 293–296 (2002).
1059
©2009 Macmillan Publishers Limited. All rights reserved
LETTERS NATURE | Vol 462 | 24/31 December 2009

24. Sasse, F., Kunze, B., Gronewold, T. M. & Reichenbach, H. The chondramides: Author Contributions D.W. (rRNA analysis, gene families, actin tree, manuscript
cytostatic agents from myxobacteria acting on the actin cytoskeleton. J. Natl. preparation), P.H. (selection of strains, analysis, manuscript preparation, project
Cancer Inst. 90, 1559–1563 (1998). coordination), L.G. and D.B. (project management), R.P., B.J.T., E.L., S.G., S.S. (strain
25. Shendure, J. & Ji, H. Next-generation DNA sequencing. Nature Biotechnol. 26, curation and growth), K.M., N.N.I., I.J.A., S.D.H., A.P., A.Ly. (annotation, genome
1135–1145 (2008). analysis), V.K. (CRISPRs, actin), M.W. (whole genome tree), P.D., C.K., A.Z. and
26. Kunin, V., Copeland, A., Lapidus, A., Mavromatis, K. & Hugenholtz, P. A M.S. (actin studies), M.N., S.L., J.-F.C., F.C. and E.D. (sequencing), C.H., A.La., M.N.
bioinformatician’s guide to metagenomics. Microbiol. Mol. Biol. Rev. 72, 557–578 and A.C. (finishing), P.C. (analysis), E.M.R. (manuscript preparation), N.C.K.
(2008). (selection of strains, annotation, analysis), H.-P.K. (strain selection and growth,
27. Ishoey, T., Woyke, T., Stepanauskas, R., Novotny, M. & Lasken, R. S. Genomic DNA preparation, manuscript preparation), J.A.E. (project lead and coordination,
sequencing of single microbial cells from environmental samples. Curr. Opin. analysis, manuscript preparation).
Microbiol. 11, 198–204 (2008).
28. Enright, A. J., Van Dongen, S. & Ouzounis, C. A. An efficient algorithm for Author Information Genome sequence and annotation data is available at the JGI
large-scale detection of protein families. Nucleic Acids Res. 30, 1575–1584 (2002). IMG-GEBA page http://img.jgi.doe.gov/geba and has been submitted to GenBank
29. Matsuura, Y. et al. Structural basis for the higher Ca21-activation of the regulated with accessions ABSZ00000000, ABTA00000000, ABTB00000000,
actin-activated myosin ATPase observed with Dictyostelium/Tetrahymena actin ABTC00000000, ABTD00000000, CP001618, ABTF00000000,
chimeras. J. Mol. Biol. 296, 579–595 (2000). ABTG00000000, ABTH00000000, ABTI00000000, ABTJ00000000,
30. Moulton, V., Semple, C. & Steel, M. Optimizing phylogenetic diversity under ABTK00000000, ABTM00000000, NZ_ABTN00000000,
constraints. J. Theor. Biol. 246, 186–194 (2007). NZ_ABTO00000000, ABTP00000000, ABTQ00000000,
NZ_ABTR00000000, NZ_ABTS00000000, NZ_ABTT00000000,
Supplementary Information is linked to the online version of the paper at NZ_ABTU00000000, NZ_ABTV00000000, NZ_ABTW00000000,
www.nature.com/nature. NZ_ABTX00000000, NZ_ABTY00000000, NZ_ABTZ00000000,
Acknowledgements . We thank the following people for assistance in aspects of the NZ_ABUA00000000, NZ_ABUB00000000, NZ_ABUC00000000,
project including planning and discussions (R. Stevens, G. Olsen, R. Edwards, NZ_ABUD00000000., NZ_ABUE00000000, NZ_ABUF00000000,
J. Bristow, N. Ward, S. Baker, T. Lowe, J. Tiedje, G. Garrity, A. Darling, S. Giovannoni), NZ_ABUG00000000, NZ_ABUH00000000, NZ_ABUI00000000,
analysis of genomes whose work could not be included in this report (B. Henrissat, NZ_ABUJ00000000, NZ_ABUK00000000, ABUL00000000,
G. Xie, J. Kinney, I. Paulsen, N. Rawlings, M. Huntemann), project management NZ_ABUM00000000, NZ_ABUO00000000, NZ_ABUP00000000,
(M. Miller, M. Fenner, M. McGowen, A. Greiner), sequencing and finishing (K. Ikeda, ABUQ00000000, NZ_ABUR00000000, ABUS00000000,
M. Chovatia, P. Richardson, T. Glavinadelrio, C. Detter), culture growth, DNA NZ_ABUT00000000, NZ_ABUU00000000, NZ_ABUV00000000,
extraction, and metadata (D. Gleim, E. Brambilla, S. Schneider, M. Schröder, NZ_ABUW00000000, NZ_ABUX00000000, NZ_ABUZ00000000,
M. Jando, G. Gehrich-Schröter, C. Wahrenburg, K. Steenblock, S. Welnitz, M. Kopitz, NZ_ABVA00000000, NZ_ABVB00000000 and NZ_ABVC00000000. All
R. Fähnrich, H. Pomrenke, A. Schütze, M. Rohde, M. Göker), and manuscript editing strains that have been sequenced are available from the DSMZ culture collection
(M. Youle). This work was performed under the auspices of the US Department of and culture accessions are available in Supplementary Information. This paper is
Energy’s Office of Science, Biological and Environmental Research Program, and by distributed under the terms of the Creative Commons
the University of California, Lawrence Berkeley National Laboratory under contract Attribution-Non-Commercial-Share Alike licence, and is freely available to all
no. DE-AC02-05CH11231, Lawrence Livermore National Laboratory under contract readers at www.nature.com/nature. Reprints and permissions information is
no. DE-AC52-07NA27344, and Los Alamos National Laboratory under contract no. available at www.nature.com/reprints. Further details on sequencing and genome
DE-AC02-06NA25396. Support for J.A.E., D.W. and M.W. was provided by the properties of each organism are being published in the journal Standards in
Gordon and Betty Moore Foundation Grant no. 1660 to J.A.E. Support for work at Genomic Sciences (SIGS) (http://standardsingenomics.org/). Correspondence
DSMZ was provided under DFG INST 599/1-1. and requests for materials should be addressed to J.A.E. (jaeisen@ucdavis.edu).

1060
©2009 Macmillan Publishers Limited. All rights reserved