Beruflich Dokumente
Kultur Dokumente
TIGR, Bioinformatics Department, 9712 Medical Center Drive, Rockville, MD 20850, USA
*To whom correspondence should be addressed. Tel: +1 301 795 7566; Fax: +1 301 838 0208; Email: selengut@tigr.org
same function. In contrast to ortholog families, this may experimentally characterized homologs. A trusted match to
include examples of paralogs (duplicated genes) and laterally an equivalog model justifies the automated transfer of these
transferred genes, so long as they have the same function. data to the sequence in question, generally reducing the
Curators of TIGRFAMs models identify a number of criteria need for manual review to a bare minimum. The annotation
in order to determine whether a model qualifies as an equiva- that results has a high degree of consistency from one genome
log. These fall into three areas: (i) the observation that two or to another, and in principle can improve consistency for
more sequences within the family have experimentally annotation projects between different annotation centers.
characterized or highly trusted annotations as a particular When, on occasion, errors or changes in scientific formalisms
function, (ii) that no sequences within the family are indi- require modifications to the terms associated with a TIGR-
cated by reasonable evidence to have a different function FAMs model, the changes can be propagated uniformly
and (iii) that phylogenetic trees constructed from members over all affected genes in a database (such as the CMR).
of the family are consistent with the hypothesis that the TIGRFAMs models, including equivalog models, have
most recent common ancestor of the characterized (or highly been used extensively in genome annotation at The Institute
trusted) members of the family is an ancestor of all of the for Genomic Research (TIGR) for over 7 years. Manual
members of the family. review of annotations suggested by TIGRFAMs models
The TIGRFAMs database contains 3000 curated protein have led to considerable feedback and to the improvement
family models, of which 1700 are of equivalog type. These of the thresholds and annotations of many models. Currently,
models enable accurate, thorough, automated assignment the AutoAnnotate program in TIGRs prokaryotic annotation
of molecular function. They are supplemented by over pipeline uses HMM evidence extensively. AutoAnnotate
100 equivalog domain models, which assign annotations weighs the evidence from HMMs, pair-wise homology
to discrete domains of multi-functional proteins, and by searches and other analyses, makes tentative assertions as
over 350 hypothetical equivalog models, which describe to molecular function and present these data to human cura-
The assertions loaded into the Genome Properties database smegmatis MC2, for instance, 58 potential sugar transporters
are produced by a flexible and powerful rules engine. In of a type represented by the family-type model PF00083 are
principle, these rules can use nearly any data type, including present while only one of them is proximal to genes associ-
manually populated annotations of specific EC numbers. In ated with rhamnose catabolism (Table 1). In other cases,
practice, the rules that perform metabolic reconstructions even though the model is not an equivalog, only one protein
for Genome Properties rely primarily on hits to TIGRFAMs is identified in the genome. In the best case, the single candi-
HMMs and some Pfam HMMs, over 900 of which are date gene is near other genes associated with the biological
currently incorporated into the system. These rules can be process as well (rhamnose epimerase in Table 1, for
applied in the absence of any annotation, automated or instance), and that arrangement is repeated across multiple
manual, as long as HMM search results are available. The genomes.
finding that all necessary proteins for some system are present The identification of genes with such auxiliary evidence
becomes a YES assertion, meaning that the system itself is can be fed back into the model building process, supporting
present, at least in principle. It need not be demonstrated the accurate definition of conserved-function family bound-
in vivo. The controlled vocabulary term some evidence aries. A significant number of TIGRFAMs equivalogs have
means that not every required component of a system has been constructed and/or validated in this manner over the
been established, but enough have been detected to suggest past 2 years. Having constructed such models, a new round
the system is present. The terms not supported and none of Genome Properties evaluation may promote the assertion
found are self-explanatory, while the term NO represents of state for that property from some evidence to YES
an even stronger negative assertion that usually reflects and may further clarify the proper function of ambiguously
additional manual review. The rules engine does not require assigned genes. Iteration of this cycle of improvements to
that every HMM be an equivalog model. It may instead Genome Property assertions and improvements to the
require the presence in a genome of at least one member of underlying TIGRFAMs models has proved a remarkably
Synergy between TIGRFAMs and Genome Properties Phylogenetic profiling using TIGRFAMs and
Not all protein families lend themselves to the assignment of Genome Properties
specific molecular functions based solely on homology to Phylogenetic profiling is the process of inferring links
multiple experimentally characterized proteins. Common between protein families and biological process based on
cases include families with only a single characterized mem- patterns of co-occurrence with other protein families involved
ber, families with many, but heterogeneous, characterized in the same biological processes (6). In practice, the phyloge-
members and those with no characterized members but are netic distribution for some biological process may differ from
subsets of larger families with established generic function. the pattern of the individual protein families that contribute
Additional information is required in these cases to define essential parts of that process. This can happen, of course,
the proper boundaries of homogeneous function and create for several reasons. Members of one protein family may
reliable equivalog models. substitute occasionally for those of another. An enzyme that
An evaluation of the genomic context of candidate family performs a specific function may participate in different pro-
members can aid in this process. The Genome Properties sys- cesses in different species. The set of parameters used to
tem is an excellent tool for examining a protein within its discriminate all true members of a protein family from all
genomic context. For example, a genome may have most other proteins may be imprecise, especially prior to manual
components of a particular metabolic pathway identified review. Missed gene calls, poor start-site predictions and
unambiguously by equivalog HMMs. One step in the path- sequencing errors may result in profiles with missing
way may have several candidate genes identified by a less members.
specific family or domain model, but there may be only Genome Properties, by creating composite objects that rep-
one of these candidates within an apparent operon with resent biological processes as opposed to molecular func-
other members of the pathway. One would infer that the tions, can be used to generate phylogenetic profiles of
embedded gene completes the pathway. In Mycobacterium much higher fidelity than those based on individual protein
Nucleic Acids Research, 2007, Vol. 35, Database issue D263
Table 1. An example of a gene cluster in Mycobacterium smegmatis MC2 identified as being responsible for rhamnose catabolism by Genome Properties analysis (http://cmr.tigr.org/tigr-scripts/CMR/shared/
GO:0043463: regulation of
family models. Considering the rhamnose catabolism prop-
Ribose operon repressor
rhamnose catabolism
erty illustrated in Table 1, eight genomes containing appar-
ently extraneous hits to certain of the components of the
Apparent operon
system can be removed from the profile, while seven
MSMEG_0582
Domain
such as the aldolase and the isomerase are captured by two
Yes
!
25
independent, non-homologous equivalogs. Profiles associated
with each of these four models would represent only a
Rhamnulo-kinase (RhaB)
GO:0019301: rhamnose
N-terminal domain
pathway.
(RhaB/RhuK)
MSMEG_0581
catabolism
Sugar kinase
10
alcohol dehydrogenase
Rhamnulose-1-phosphate
1-phosphate aldolase/
rhamnose catabolism
Oxidoreductase, short-
aldolase/alcohol
Property (5).
Hypothetical equivalog EQUIVALOG
GO:0019301:
L-rhamnose isomerase
(RhaA/RhaI)
MSMEG_0579
GO:0015762: rhamnose
transporter
58
Rhamnose epimerase
catabolism
(DUF718)
ACKNOWLEDGEMENTS
Family
Genome Properties
annotated name
Number of hits in
REFERENCES (2000) Gene Ontology: tool for the unification of biology. The Gene
Ontology Consortium. Nature Genet., 25, 2529.
1. Bateman,A., Coin,L., Durbin,R., Finn,R.D., Hollich,V., 5. Haft,D.H., Selengut,J.D., Brinkac,L.M., Zafar,N. and White,O. (2005)
Griffiths-Jones,S., Khanna,A., Marshall,M., Moxon,S., Genome Properties: a system for the investigation of prokaryotic genetic
Sonnhammer,E.L. et al. (2004) The Pfam protein families database. content for microbiology, genome annotation and comparative
Nucleic Acids Res., 32, D138D141. genomics. Bioinformatics, 21, 293306.
2. Haft,D.H., Selengut,J.D. and White,O. (2003) The TIGRFAMs database 6. Pellegrini,M., Marcotte,E.M., Thompson,M.J., Eisenberg,D. and Yeates,T.O.
of protein families. Nucleic Acids Res., 31, 371373. (1999) Assigning protein functions by comparative genome analysis: protein
3. Peterson,J.D., Umayam,L.A., Dickinson,T., Hickey,E.K. and White,O. phylogenetic profiles. Proc. Natl Acad. Sci. USA, 96, 42854288.
(2001) The comprehensive microbial resource. Nucleic Acids Res., 29, 7. Haft,D.H., Paulsen,I.T., Ward,N. and Selengut,J.D. (2006)
123125. Exopolysaccharide-associated protein sorting in environmental
4. Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H., organisms: the PEP-CTERM/EpsH system. Application of a novel
Cherry,J.M., Davis,A.P., Dolinski,K., Dwight,S.S., Eppig,J.T. et al. phylogenetic profiling heuristic. BMC Biol., 4, 29.