Beruflich Dokumente
Kultur Dokumente
The databases EMBL, GenBank, and DDBJ are the three primary
nucleotide sequence databases: They include sequences submitted
directly by scientists and genome sequencing group, and sequences
taken from literature and patents. There is comparatively little error
checking and there is a fair amount of redundancy.
The entries in the EMBL, GenBank and DDBJ databases
are synchronized on a daily basis, and the accession numbers are
managed in a consistent manner between these three centers.
The nucleotide databases have reached such large sizes that they are
available in subdivisions that allow searches or downloads that are
more limited, and hence less time-consuming. For example, GenBank
has currently 17 divisions.
There are no legal restrictions on the use of the data in these
databases. However, there are patented sequences in the databases.
EMBL www.ebi.ac.uk/embl/
The EMBL (European Molecular Biology Laboratory) nucleotide
sequence database is maintained by the European Bioinformatics
Institute (EBI) in Hinxton, Cambridge, UK. As of 16 Jan 2001, it
Protein Sequence
The two protein sequence databases SWISS-PROT and PIR are
different from the nucleotide databases in that they are both curated.
This means that groups of designated curators (scientists) prepare the
entries from literature and/or contacts with external experts.
SWISS-PROT, TrEMBL www.expasy.ch/sprot/
SWISS-PROT is a protein sequence database which strives to provide
a high level of annotations (such as the description of the function of
a protein, its domains structure, post-translational modifications,
variants, etc.), a minimal level of redundancy and high level of
integration with other databases.
It was started in 1986 by Amos Bairoch in the Department of Medical
Biochemistry at the University of Geneva. This database is generally
considered one of the best protein sequence databases in terms of the
quality of the annotation. Release 39.12 (11 Jan 2001) contained
92,211 entries.
TrEMBL is a computer-annotated supplement of SWISS-PROT
that contains all the translations of EMBL nucleotide sequence entries
PIR grew out of Margaret Dayhoff's work in the middle of the 1960s.
It strives to be comprehensive, well-organized, accurate, and
consistently annotated. However, it is generally believed that it does
not reach the level of completeness in the entry annotation as does
SWISS-PROT. Although SWISS-PROT and PIR overlap extensively,
there are still many sequences which can be found in only one of
them.
One can search for entries or do sequence similarity searches at the
PIR site. The database can also be downloaded as a set of files. An
example of what an entry looks like is given for the human raf-1
oncogene protein, ID TVHUF6.
PIR also produces the NRL-3D, which is a database of sequences
extracted from the three-dimensional structures in the Protein
Databank (PDB) (see also the following page in this lecture. The
NRL_3D database makes the sequence information in PDB available
for similarity searches and retrieval and provides cross-reference
information for use with the other PIR Protein Sequence Databases.
Structure databases.
Pfam www.sanger.ac.uk/Software/Pfam/, www.cgr.ki.se/
Pfam/
Pfam is a database of protein families defined as
domains (contiguous segments of entire protein sequences). For each
Secondary Dtabases:
Those data that are derived from the analysis or treatement of primary data such as
secondary structures, hydrophobicity plots, and domain are stored in secondary
databases
composite databases
collection
of
all
publicly
Center
Information(NCBI)
as
and
and
for
part
of
maintained
by
Biotechnology
the International
the
entries
downloadable by FTP.
EMBL,
are
retrievable
by Entrez or
The
EMBL
Nucleotide
Sequence
Database
DDBJ,
The DNA Data Bank of Japan (DDBJ) is a biological
database that collects DNA sequences.
DDBJ Center collects nucleotide sequence data as a
member of INSDC (International Nucleotide Sequence
Database Collaboration) and provides freely available
nucleotide sequence data and supercomputer system, to
support research activities in life science.
It exchanges its data with European Molecular Biology
Laboratory at the European Bioinformatics Institute and
withGenBank at the National Center for Biotechnology
Information on a daily basis. Thus these three databanks
contain the same data at any given time.
DDBJ began data bank activities in 1986 at NIG and
remains the only nucleotide sequence data bank in Asia.
Although DDBJ mainly receives its data from Japanese
researchers, it can accept data from contributors from
any other country.
Culture,
Sports,
Science
and
Technology (MEXT).
The principal purpose of DDBJ operations is to improve
the quality of INSD, as public domains.
SWISSPROT
UniProtKB is a compre hensive protein sequence
knowledgebase
that
consists
UniProtKB/Swiss-Prot,
which
of
two
contains
sections:
manually
information
structure,
post-translational
modifications,
and TrEMBL.
UniProtKB/TrEMBL
contains
high-quality
manual
annotation
process
of
from PDB,
and
from
gene
prediction,
Secondary databases
PROSITE,
PROSITE is a protein database.[1][2] It consists of entries
describing the protein families, domains and functional
sites as well as amino acid patterns and profiles in them.
It is based on the observation that, while there is a huge
number of different proteins, most of them can be
grouped, on the basis of similarities in their sequences,
into a limited number of families.
Proteins or protein domains belonging to a particular
family generally share functional attributes and are
derived from a common ancestor.
PROSITE currently contains patterns and profiles
specific for more than a thousand protein families or
domains.
Each of these signatures comes with documentation
providing background information on the structure and
function of these proteins.
The ProRule section of PROSITE is constituted of
manually created rules that can automatically generate
annotation in the UniProtKB/Swiss-Prot format based on
PROSITE motifs.
PRINTS,
PRINTS database
is
collection
of
so-called
"fingerprints"
it provides both a detailed annotation resource for protein
families, and a diagnostic tool for newly determined
sequences.
A fingerprint is a group of conserved motifs taken from
a multiple sequence alignment - together, the motifs form
a characteristic signature for the aligned protein family.
The motifs themselves are not necessarily contiguous in
sequence, but may come together in 3D space to define
molecular binding sites or interaction surfaces.
Pfam,
Pfam is a database of protein families that includes their
annotations and multiple sequence alignments generated
using hidden Markov models.
For each family in Pfam one can:
Look at multiple alignments
BLOCKS,
The Blocks Database contains multiple alignments of
conserved regions in protein families.
BioGRID
The Biological General Repository for Interaction
Datasets (BioGRID)
is
database of protein-protein
created in 2003
It strives to provide
a
and
curated biological
genetic
comprehensive
interactions
resource
for
attempting
all
to
and InterPro.
InterPro provides functional analysis of proteins by
classifying them into families and predicting domains
and important sites.
InterPro is a database of protein families, domains and
functional sites in which identifiable features found in
known proteins can be applied to new protein
sequences[1] in order to functionally characterise them.
The contents of InterPro are based around diagnostic
signatures and the proteins that they significantly match.
The signatures consist of models (simple types, such
asregular expressions or more complex ones, such
as Hidden Markov models) which describe protein
families, domains or sites.
Models are built from the amino acid sequences of
known families or domains and they are subsequently
used to search unknown sequences (such as those arising
from novel genome sequencing) in order to classify
them.
classifications
Gene3D)
through
(SUPERFAMILY and
to
quite
specific
CATH-
sub-family
an
InterPro accession
number,
for
instanceIPR000001.
InterPro aims to release data to the public every 8 weeks,
typically within a day of the UniProtKB release of the
same proteins.
Structure databases
PDB,
The Protein Data Bank (PDB) is a repository for the
three-dimensional structural data of large biological
molecules, such as proteinsand nucleic acids. (See
also crystallographic database.)
The
data,
typically
crystallography or NMR
obtained
spectroscopy and
by X-ray
submitted
structures
source
programs
include
ICM-Browser,
Swiss-PDB Viewer,
SCOP,
The Structural Classification of Proteins (SCOP)
database is
largely
protein structural
manual
domainsbased
classification
of
similarities
of
on
provides
detailed
and
comprehensive
functional
similarity
are
placed
in
different
CATH,
The CATH Protein Structure Classificationis a semiautomatic, hierarchical classification of protein domains
CATH shares many broad features with its principal
rival, SCOP, however there are also many areas in which
the detailed classification differs greatly.
Only
family)
and Homologous
superfamily
2 Architecture
3 Topology
Homologous
superfamily
EuroCarbDB,
EuroCarbDB is an EU-funded initiative for the creation
of software and standards for the systematic collection
of carbohydrate structures and their experimental data.
The EUROCarbDB project is a design study for a
technical framework, which provides sophisticated,
freely accessible, open-source informatics tools and
databases
research.
to
support glycobiology
and
glycomic
is
complemented
by
suite
of
of
for
source project.
carbohydrate
EuroCarbDB is
of
hosted
by
developed open
PubChem Compound,
PubChem is a database of chemicalmolecules and their
activities against biological assays. The system is
maintained by theNational Center for Biotechnology
Information(NCBI), a component of the National Library
of Medicine, which is part of the United StatesNational
Institutes of Health (NIH).
PubChem can be accessed for free through a web user
interface.
Millions of compound structures and descriptive datasets
can be freely downloaded via FTP.
PubChem contains substance descriptions and small
molecules with fewer than 1000 atoms and 1000 bonds.
More than 80 database vendors contribute to the growing
PubChem database.
PubChem consists of three dynamically growing primary
databases. As of 7 January 2011:
Compounds,
Substances,
BioAssay,
PubChem Compound (4) is a searchable database of
chemical structures with validated chemical depiction
information provided to describe substances in PubChem
Substance.
Structures stored within PubChem Compounds are preclustered and cross-referenced by identity and similarity
groups.
PubChem Compound includes over 5M compounds.
allow
restrictions.
DrugBank,
searching
with
specific
element
The DrugBank database[1] is a comprehensive, highquality, freely accessible, online database containing
information on drugs and drug targets.
As
both
bioinformatics
and
students
and
the
general public.
Its extensive drug and drug-target data has enabled the
discovery and repurposing of a number of existing drugs
to treat rare and newly identified illnesses.
155
(protein/peptide)
drugs,
FDA-approved
87nutraceuticals and
biotech
over
6000experimental drugs.
Each DrugCard entry contains more than 200 data fields
with half of the information being devoted to
drug/chemical data and the other half devoted to drug
target or protein data.
All data in DrugBank is non-proprietary or is derived
from a non-proprietary source. It is freely accessible and
available to anyone.
In addition, nearly every data item is fully traceable and
explicitly referenced to the original source. DrugBank
data is available through a public web interface and
downloads.
ChemSpider
ChemSpider is a chemical database owned by theRoyal
Society of Chemistry.
The
database
contains
more
than
30
million
systematic
formula
and molecular
apps
for
in
variety
of
software
such
is
also
possible
to
calculate
what
the
new
structures
each
year)
and
with