Sie sind auf Seite 1von 8

SURVEY OF ONTOLOGIES IN BIOINFORMATICS

by
SHAKTHEEVEL.S

INTRODUCTION TO ONTOLOGIES
Recent technological advances have resulted in an onslaught of biological information that is
accessible online. In the post genomic era, a major bottle neck is the coherent integration of all
these public, online resources. Online bioinformatics databases are especially difficult to
integrate because they are complex, highly heterogeneous, dispersed, and incessantly evolving.
Online data are often described only in human-readable formats that are difficult for computers
to analyze due to the lack of Standardized structures.

They are large number of biomedical ontologies and data bases that are currently available ,
and more continue to be developed .there is even a site that tracks the publicly available
Sources. Ontologies have emerged because of the need for a Common language to develop
effective human and computer Communication across scattered, personal sources of data and
Knowledge. The survey of Ontologies and databases used in the bioinformatics Community.
The ontologies in this section are concerned with medical and biological terminology and with
ontologies for organizing other ontologies. The focus of the survey as used main XML- based
ontologies for for bioinformatics. Some of many data bases that have been developed for
biomedical purposes. Each database has its own structure and there fore can be regarded as
defining a ontology.

BIO-ONTOLOGIES
The Ontologies are a versatile mechanism for understanding concepts and relationships. In
this section the concern is with the human communication of biomedical concepts as well as
with understanding of the knowledge. The first one was originally focused on medical
terminology but now also includes many other biomedical vocabularies, has grown to be
impressively large, but is some times incoherent as a result. The second ontology focuses
exclusively on terminology for genomics.

UNIFIED MEDICAL LANGUAGE SYSTEM


Terminology is the most common denominator of all biomedical literature resources,
including the names of organisms, tissue, cell types, genes, proteins, diseases. There are
various controlled Vocabularies such as the medical Subject Headings (MeSH) associated with
recourses. MeSH was developed by the U.S. National Library of Medicine(NLM). In 1986,
NLM began a long –term research and development project to build the unified Medical
language System(UMLS). The UMLS is a repository of biomedical vocabularies and is the
NLM’s biological ontology(Lindberg et al.1993; Baclawski et al.2000; Yandell and Majoros
2002).The UMLS is composed of three main components: the Metathesaurus(META), the
SPECIALIST lexicon and associated lexical programs and the Sementic Network(SN) (Denny
et al. 2003). The UMLS is a rich source of Knowledge in the biomedical domin. The UMLS is
used for research and development in a rang of different applications, including natural
language processing (Baclawski et al.2000; McCray et al.2001)
Ref: www.nlm.nih.gov/research/ulms.
THE GENE ONTOLOGY
The most prominent ontology for bioinformatics is GO. GO is produced by the GO
Consortium, which seeks to provide a structured, controlled vocabulary for the description of
the gene product function, process, and location(GO 2003,2004). A description of a gene
product using the GO terminology is called an annotation. One important use of GO is the
prediction of gene function based on patterns of annotation.
These annotation classifying it three
a) Molecular function
b) Biological process
c) Cellular component
Many programs have been developed for profiling gene expression based on GO file
format. Some as follows
DAG- Edit: DNA- Edit is an open source tool written in Java for browsing, searching and
modifying structured controlled vocabularies.
Ref: sourceforge.net/project/showfiles.php?group_id=36855
GenMAPP: This tool visualizes gene expression and other genomic data on maps
representing biological pathways and grouping of genes.
Ref: www.GenMAPP.org
GoMiner: This program package organizes lists of “interesting” genes for biological
interpretation. GoMiner provides quantitative and statistical output files .
Ref: discover.nci.nih.gov/gominer
NetAffx GO Mining Tool: This Tool permits web- based, interactive traversal of the GO
graph in the context of microarry data (chang et al.2004)
Ref: www.affymetrix.cim/analysis/index.affx
FatiGo: This tool extracts GO terms that are significantly over or underrepresented in sets
of genes within the context of a genome – Scale experiment( Al- Shahrour et l.2004) Ref:
fatgo.bioinfo.cnio.es
GOAL: The GO Automated Lexicon is a web – based application for the automated
identification of functions and process.(volinia et al.2004)
Ref: microarrays.unife.it
Onto-Tools: This is a Collection of tools for a varity of tasks all of which involve the use
of GO terminology(Draghici et al.2003),
Ref : vortex.cs.wayne.edu/projects.html
DAVID: The Database for Annotation, Visualization and Intergrated Discovery is aweb
based gene list.(Dennis, Jr.et al 2003)
Ref: david.niaid,nih,gov
GOTM: The GO Tree Machine is a web – based platform for interpreting microarray data
or other interesting gene sets using GO (Zhang et la.2004)
Ref: genereg.ornl.gov/gotm
ONTOLOGIES OF BIOINFORMATICS ONTOLOGIES
With the proliferation of biological ontologies and databases, the ontologies themselves
need to be organized and classified.
OBO: The Open Biological Ontologies seeks to collect ontogies for the domains of
genomics and proteomics. these Ontology to be open ,use either GO or OWL syntax.
Ref: www.obo.sourceforge.net
Ontology in OBO of zygote development from one cell stage to Two cell stage

$ structurers.goff; ZFIN:0000000
<001_Zygote\:1-cell\,embryo; ZFIN:0000004
<001_Zygote\:1-cell\,blastomere; ZFIN:0000001
<001_Zygote\:1-cell\,Yolk; ZFIN:0000012
<001_Zygote\:1-cell\,extraembryonic; ZFIN:0000005
<001_Zygote\:1-cell\,chorion; ZFIN:0000002
<002_Cleavage\:2-cell\,embryo; ZFIN:0000017
<002_Cleavage\:2-cell\,blastomeres; ZFIN:0000013
<002_Cleavage\:2-cell\,Yolk; ZFIN:0000025
<002_Cleavage\:2-cell\,extraembryonic; ZFIN:0000018
<002_Cleavage\:2-cell\,chorion;ZFIN:0000014

TAMBIS: TAMBIS is a project that aims to help researchers in biological Science by


building a homogenizing layer on top of various biological information services.
Ref: img.cs.man.ac.uk/tambis

ONTOLOGY LANGUAGES IN BIOINFORMATICS


The main XML- based ontologies that have been developed for Bioinformatics. The
number of such ontologies is large , and continually increasing , so some of the ontologies
as below
BSML: The bioinformatics Sequence markup Language(BSML) is a language that
encodes biological sequence information , which encompasses graphical representation of
biologically meaningful objects such as nucleotide or protein sequences.
Ref: www.bsml.org
BioML: The Biopolymer Markup language provides an extensible framework for
annotating experimental information about molecular extensible proteins and genes.
Ref: www.rdcormia.com/COIN78/files/XML_Finals/BIOML/Pages/BIOML.htm
SBML: The Systems Biology Markup Language is an XML- based language for storing
biochemical models (Hucka et al. 2003)
Ref: www.sbw-sbml.org
MAGE-ML: The MicroArray Gene Expression Markup Language is an XML Ontology
for microarry data. MAGE- ML aims to create a common data format so that data can be
shared easily between projects(Stoeckert, Jr.et al.2002
Ref: www.mged.org
CellML: The CellML ontology is being developed by physiome Science Inc. The purpose
of CellML is to store and exchange computer- based biological models.
Ref: www.cellml.org
RNAML: These provides a standard syntax that allows for the storage and Exchange of
information about RNA sequence as well as secondary and tertiary structures.
Ref: www.1bit.iro.umontreal.ca/rnaml.
AGAVE: The Architecture for Genomic annotation , Visualization and Exchange is an
XML language created by Double Twist, Inc, for representing genomic annotation data
Ref: www.animorphics.net/lifesci.html
CML: The purpose of the CML is to manage chemical information. CML supported by
tools such as the popular Jumbo browser.
Ref: www.xml-cml.org
GAME: GAME is an XML language for curation of DNA,RNA, or protein
sequences.GAME uses an XMLDTD to specify the syntactic structure of the content of a
GAME document.
Ref: www.fruitfly.org/comparative
NeuroML: The Neural Open Markup Language is an XML language for describing
models, methods , and literature for neuroscience.(Goddard et al.2002)
Ref: www.neuroml.org/main.html
TML: Taxonomical Markup Language is mainly an XML language format for
representing the topology of a phylogeny,but alos representation for statistical metadata
describing (Gilmour2000)
NUCLEOTIDE SEQUENCE DATABASES
GenBank: Genbank is a comprehensive database that contains publicy available DNA
Sequences for more than 140,000 name organisms. The sequences are primarily obtained
through submission from individual laboratories and batch submission from large-scale
sequencing projects(Benson et al.2004)
Ref: www.ncbi.nlm.nih.gov/Genbank
EMBL:
The EMBL Nucleotide Sequence Database, maintained at the European Bioinformatics
Institute(EBI), incorporates, organizes, and distributes nucleotide sequence from public
sources (kulilova et al.2004). The database is a part of an international collaboration with
DDBJ and GenBank
Ref: www.ebi.ac.uk/embl
DDBJ:
DDBJ is maintained at the National Institute of Genetics in japan. Its available in several
formats, including FASTA and XML. The XML format is defined by the DTD at
ftp://ftp.ddbj.nig.ac.jp/database/ddbj.xml/DDBJXML.dtd.
PROTEIN SEQUENCE DATABASE
SWISS-PROT:
SWISS-PROT is the most widely used publicly available protein sequence database. This
database aims to be nonredundant, fully annotated, and highly cross- referenced (Jung et
al.2001) . The XML format is defined both as a DTD and using XSD.
Ref: www.au.expasy.org/sprot /
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.xml.gz.

NUCLEOTIDE STRUCTURE DATABASES


NDB: The most prominent nucleotide structure database is the Nucleic Acid Database.
NDB was establish in 1991 as a resource to assemble and distribute structural information
about nucleic acids(Berman et al.1992)
Ref: www.ndbserver.rutgers.edu
PROTEIN STRUCTURE DATABASE
Protein structure databases deal with progressively ‘ higher-order’ types of structure:
secondary, tertiary, quaternary, and functional.
Pfam: The protein family database is large collection of protein families and
domains(Bateman et al.2004), The Pfam database is available in the FASTA format
Ref: www.sanger.ac.uk/software/Pfam
SMART: The Simple modular Architecture Research Tool is a web tool for the
identification and annotation of protein domins, and provides a platform for the
comparative study of complex domin architectures in genes and proteins.
Ref: www.smart.embl.de
PROSITE:
PROSITE is a compilation of sites and patterns found in protein sequences(sigrist et
al.2002; Hulo et al.2004). The use of protein sequence patterns to determine the protein
function has become one of the essential tools in sequence analysis. PROSITE is closely
related to the SWISS-PROT protein sequence databank
Ref: www.expasy.org/prosite
BLOCKS:
Blocks are defined as ungapped multiple alignments corresponding to the most conserved
regions of proteins.Blocks contains “ multiple alignment” information,and the use of the
BLOCKS database can improve the detection of sequence similarities in searches of
sequence databases. The BLOCKS database contains more than 24,294 blocks from nearly
5000 different protein groups(henikoff et al.2000).
Ref: www.blocks.fhcrc.org
COG: the database of clusters of orthologus groups of proteins (COGs) attempts to give a
phylogenetic classisication of the protein encoded in 21 complete genomes of bacteria,
archaea, and eukaryotes(Tatusov et al.2000).
Ref: www.ncbi.nlm.nih.gov/COG
PRINTS:
PRINTS is a compendium of protein fingerprints(Attwood et al.1999,2003). Its also
available in the FASTA format.
Ref: www.umber.sbs.man.ac.uk/dbbrowser/PRINTS
ProDom: ProDom is a comprehensive set of protein domain famils automatically
generated from the SWISS-PROT and TrEMBL sequence databases(servant et al.2002)
Ref: http://protein.toulouse.inra.fr/prodom/current/html/home.php
PDB: The protein databank is the largest source of publicy available biomolecular 3D
structures(Bateman et al.2004). PDB was established at Brookhaven National
Laboratories(BNL) in 1971 as an archive for biological macromolecular crystal structures.
The PDB database has two non-XML formats, PDB and mmCIF, that are in use by many
other structure databases. The current XML schema file is located at
Ref: www.sit.pdb.org/pdbml/pdbx-vxsd, www.rcsb.org/pdb
SCOP: The structural Classification of proteins database classifies proteins by domains
that have a common ancector based on sequence, structural, and functional
evidence(Murzin et al.1995).
DIP: The database of interacting proteins is a research tool for studying cellular networks
of protein interactions(Salwinski et al.2004)
Ref: www.dip.doe-mbi.ucla.edu

MINT: The Molecular INTeraction database is a relational database containing interaction


data between biological molecules(Zanzoni et al.2002)
Ref: http://160.80.34.4/mint
HPID: The Human protein Interaction Database was designed for provide human protein
interaction data precomputed from existing structural and experimental data using
appropriate methods.
Ref: http://wilab.inha.ac.kr/hpid/

TRANSCRIPTION FACTOR DATABASES


TRANSFAC: The most complete transcription factor database is
TRANSFAC(wingender et al.1996)This database is concerned with eukaryotic
transcription regulation.
Ref: www.transfac.gbf.de
COMPEL:
COMPLE is a database of composite regulatory elements, the basic structures of
combinatorial regulation. Access to COMPLE requires registration , but it is free for
noncommercial use.
Ref: www.comple.bionet.nsc.ru
SPECIES – SPECIFIC DATABASES
SGD: The Sacharomyces Genome Database is a database of the molecular biology and
genetics of the budding yeast Saccharomyces cerevisiae.
Ref: www.yeastgenome.org
FlyBase:
The fruit fly, Drosophila melanogaster, is one of the most studied eukaryotic organisms
and a central model for the Human Genome Project(FlyBase2002).
Ref: www.flybase.bio.indiana.edu
MGD: The Mouse Gnome Databases at the Jackson Laboratory in Bar Harbor,Maine, is a
resource for the mouse genome information.
Ref: www.informatics.jax.org.

SPECIALIZED PROTEIN DATABASES


ORDB: The Olfactory Receptor Database is a central repository of olfactory receptor(OR)
and Olfactory receptor –like gene and protein sequences(crasto et al.2002). Human detect
odorants through Ors, Which are located on the olfactory sensory neurons in the olfactory
epithelium of the nose (Buck and Axel;1991 Buck200).
Ref: www.senselab.med.yale.edu/senselab/ordb
RiboWeb:
RiboWeb is a relational database conatning a representation of the primary 3D data
relevant to the structure of their ribosome of the prokaryotic 30s ribosomal subunit,which
initiates the translation of messenger RNA into protein an dis the site of action of numerous
antibiotics(chen et al.1997).
Ref: www.smi-web.stanford.edu/projects/helix/riboweb.html
TRANSCRIPTOMICS DATABASES
It is use ful to study the temporal and spatial patterns of gene expression. Transcriptomics
is defined as the use of quantitative mRNA measurements of gene expression to
characterize biological process and elucidate gene transcription mechanisms.

S.NO NAME OF THE DATABASES TOOL FOR SEARCH


1. NCBI’s dbEST Database www.ncbi.nlm.nih.gov/dbEST/
2. The GeneCards database bioinformatics.weizmann.ac.il/cards
3. Kidney development Gene organogenesis.ucsd.edu
Expression Database
4. Gene Expression in Tooth bite-it.helsinki.fi
5. Mouse Gene Expression www.informatics.jax.org
Database
6. The Cardiac Gene Expression www.cage.wbmei.jhu.edu
Knowledgebase
7. Cancer Gene Expression www.cage.wbmei.jhu.edu
Omnibus
8. Saccharomyces Genome Database www.yeastgenome.org
9. The Nematode Expression Nematode.lab.nig.ac.jp
pattern Database
10. NCBI’s Gene Expression www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=geo
Omnibus

PROTEOMIC DATABASE
Proteomics is defined as the use of quantitative protein-level measurements of gene
expression to characterize biological processes and elucidate the mechanisms of gene
translation. There are generally two steps in proteomics as protein separation and
identification.

S.NO NAME OF THE DATABASES TOOL FOR SEARCH


1. HEART-2DPAGE userpage.chemie.fu-
berlin.de/~pleiss/dhzb.html
2. Heart High-performance 2 – DE www.mdc-berlin.de/~emu/heart
Database
3. SWISS-2DPAGE au.expasy.org/ch2d
4. REPRODUCTION-2DPAGE www.reprod.njmu.edu.cn/cgi-bin/2d/2d.cgi
5. Fishprom www.abdn.ac.uk/fishprom/index.shtml
PATHWAY DATABASE
A pathway is a system of molecules(especially protein) that work together. Pathway is
also called Molecular Interaction networks.

BioPAX:
BioPAX is a collaborative effort to create a data exchange format for biological pathway
data. The current format is called BioPAX level-1 and represents metabolic pathway
information.
Ref: www.biopax.org/
KEGG:
The Kyoto Encyclopedia of genes and Genomes (kanehisa and Goto2000; kanehisa et
al.2002) is the primary database resource of the Japanese genome net service for
understanding higher-order functional meanings and utilities of the cell or the organism
from its genome information.
Ref: www.genome.ad.jp/kegg
EcoCyc:
Ecocyc is an organism- Specific pathway database describe the metabolic and signal
transduction pathways of E.coliK12 MG1655, its enzymes , and its transport proteins(Karp
et al.2002c)
Ref: www.ecocyc.org

References:

 Bioinformatics by Kenneth Baclawski & Tianhua niu


 Benson,D.A.,I.karsch-Mizrachi,D.J.Lipman,J.Ostell,and
D.L.Wheeler.2004.GeneBank:update.Nucleic Acids res.32:D23-D26.Database
issue.
 Bateman, A.,L.coin,R.Durbin,R.D.Finn,V.Hollich,S.Griffiths-
Jones, A.Khanna, M.Marshall, S.Moxon, E.L.Sonnhammer, D.J.Studholme,C.Yets,
and S.R.Eddy.2004.The Pfam Protein families database, Nucleic Acids
Res.32:D138- D141.Data-base issue.
 Murzin,A.G.,S.E.Brenner,T.Hubbard,and C.Chothia.1995.SCOP: a structural
Classification of protein database for the investigation of Sequence and Structures.
J.Mol.Biol.247:536-540.
 http://www.geocities.com/bioinformaticsweb/datalink.html
 http://www.clcbio.com/index.php?id=502

Das könnte Ihnen auch gefallen