Sie sind auf Seite 1von 39

Nucleotide Sequence,

The databases EMBL, GenBank, and DDBJ are the three primary
nucleotide sequence databases: They include sequences submitted
directly by scientists and genome sequencing group, and sequences
taken from literature and patents. There is comparatively little error
checking and there is a fair amount of redundancy.
The entries in the EMBL, GenBank and DDBJ databases
are synchronized on a daily basis, and the accession numbers are
managed in a consistent manner between these three centers.
The nucleotide databases have reached such large sizes that they are
available in subdivisions that allow searches or downloads that are
more limited, and hence less time-consuming. For example, GenBank
has currently 17 divisions.
There are no legal restrictions on the use of the data in these
databases. However, there are patented sequences in the databases.
EMBL www.ebi.ac.uk/embl/
The EMBL (European Molecular Biology Laboratory) nucleotide
sequence database is maintained by the European Bioinformatics
Institute (EBI) in Hinxton, Cambridge, UK. As of 16 Jan 2001, it

contained 10,378,022 records with a total of 11,302,156,937 bases;


see the EMBL DB statistics page.
It can be accessed and searched through the SRS system at EBI, or
one can download the entire database as flat files. An example of what
an entry looks like is given for the human raf oncogene protein, ID:
HSRAFR.
GenBank www.ncbi.nlm.nih.gov/Genbank/
The GenBank nucleotide database is maintained by the National
Center for Biotechnology Information (NCBI), which is part of the
National Institute of Health (NIH), a federal agency of the US
government.
It can be accessed and searched through the Entrez system at NCBI,
or one can download the entire database as flat files. An example of
what an entry looks like is given for the human raf oncogene protein,
Locus: HSRAFR.
DDBJ www.ddbj.nig.ac.jp
The DNA Data Bank of Japan began as a collaboration with EMBL
and GenBank. It is run by the National Institute of Genetics. One can
search for entries by accession number, FASTA/BLAST, keywords
and regular expressions.
The following databases contain subsets of the EMBL/GenBank
databases. Some also contain more information or links than the

primary ones, or have a different organization of the data to better suit


some specific purpose. However, the nucleotide sequences themselves
should always be available in the EMBL/GenBank databases. In this
sense, the databases below are secondary databases.
UniGene www.ncbi.nlm.nih.gov/UniGene/
The UniGene system attempts to process the GenBank sequence data
into a non-redundant set of gene-oriented clusters. Each UniGene
cluster contains sequences that represent a unique gene, as well as
related information such as the tissue types in which the gene has
been expressed and map location.
SGD genome-www.stanford.edu/Saccharomyces/
The Saccharomyces Genome Database (SGD) is a scientific database
of the molecular biology and genetics of the yeast Saccharomyces
cerevisiae.
EBI Genomes www.ebi.ac.uk/genomes/
This web site provides access and statistics for the completed
genomes, and information about ongoing projects.
Genome Biology www.ncbi.nlm.nih.gov/Genomes/
The Genome Biology site at NCBI contains information about the
available complete genomes.
Ensembl www.ensembl.org

Ensembl is a joint project between EMBL-EBI and the Sanger Centre


to develop a software system which produces and maintains automatic
annotation of eukaryotic genomes.

Protein Sequence
The two protein sequence databases SWISS-PROT and PIR are
different from the nucleotide databases in that they are both curated.
This means that groups of designated curators (scientists) prepare the
entries from literature and/or contacts with external experts.
SWISS-PROT, TrEMBL www.expasy.ch/sprot/
SWISS-PROT is a protein sequence database which strives to provide
a high level of annotations (such as the description of the function of
a protein, its domains structure, post-translational modifications,
variants, etc.), a minimal level of redundancy and high level of
integration with other databases.
It was started in 1986 by Amos Bairoch in the Department of Medical
Biochemistry at the University of Geneva. This database is generally
considered one of the best protein sequence databases in terms of the
quality of the annotation. Release 39.12 (11 Jan 2001) contained
92,211 entries.
TrEMBL is a computer-annotated supplement of SWISS-PROT
that contains all the translations of EMBL nucleotide sequence entries

not yet integrated in SWISS-PROT. The procedure that is used to


produce it was developed by Rolf Apweiler. Release 15.14 (5 Jan
2001) contained 378,152 entries. The annotation of an entry in
TrEMBL has not (yet) reached the standards required for inclusion
into SWISS-PROT proper.
SWISS-PROT and TrEMBL are developed by the SWISS-PROT
groups at Swiss Institute of Bioinformatics (SIB) and at EBI. The
databases can be accesses and searched through the the SRS system at
ExPASy, or one can download the entire database as one single flat
file. An example of what an entry looks like is given for the human raf
oncogene protein, ID KRAF_HUMAN.
The SWISS-PROT database has some legal restrictions: the entries
themselves are copyrighted, but freely accessible and usable by
academic researchers. Commercial companies must pay a license fee
from SIB to use SWISS-PROT.
PIR pir.georgetown.edu
The Protein Information Resource (PIR) is a division of the National
Biomedical Research Foundation (NBRF) in the US. It is involved in
a collaboration with the Munich Information Center for Protein
Sequences (MIPS)and the Japanese International Protein Sequence
Database (JIPID). Release 67.00 (31 Dec 2000) contains 198,801
entries.

PIR grew out of Margaret Dayhoff's work in the middle of the 1960s.
It strives to be comprehensive, well-organized, accurate, and
consistently annotated. However, it is generally believed that it does
not reach the level of completeness in the entry annotation as does
SWISS-PROT. Although SWISS-PROT and PIR overlap extensively,
there are still many sequences which can be found in only one of
them.
One can search for entries or do sequence similarity searches at the
PIR site. The database can also be downloaded as a set of files. An
example of what an entry looks like is given for the human raf-1
oncogene protein, ID TVHUF6.
PIR also produces the NRL-3D, which is a database of sequences
extracted from the three-dimensional structures in the Protein
Databank (PDB) (see also the following page in this lecture. The
NRL_3D database makes the sequence information in PDB available
for similarity searches and retrieval and provides cross-reference
information for use with the other PIR Protein Sequence Databases.

Structure databases.
Pfam www.sanger.ac.uk/Software/Pfam/, www.cgr.ki.se/
Pfam/
Pfam is a database of protein families defined as
domains (contiguous segments of entire protein sequences). For each

domain, it contains a multiple alignment of a set of defining


sequences (the seeds) and the other sequences in SWISS-PROT and
TrEMBL that can be matched to that alignment.
The database was started in 1996 and is maintained by a consortium
of scientists, among them Erik Sonnhammer (CGR, KI, Sweden),
Sean Eddy (WashU, St Louis USA), Richard Durbin, Alan Bateman
and Ewan Birney (Sanger Centre, UK). Release 5.5 (Sep 2000)
contains 2478 families.
The alignments can be converted into hidden Markov
models (HMM), which can be used to search for domains in a query
protein sequence. The software HMMER (by Sean Eddy) is the
computational foundation for Pfam. The domain structure of protein
sequences in SWISS-PROT and TrEMBL are available directly from
the Pfam web sites, and it is also possible to search for domains in
other sequences using servers at the web sites.
The technology behind Pfam/HMMER will be discussed in a lecture
later in this course.
The Pfam database can be searched, or used to identify domains in a
sequence, or downloaded from the websites above. An example of a
multiple sequence alignment that defines a protein family (domain) is
given for the Raf-like Ras-binding domain (Pfam name RBD,
accession code PF02196).

The Pfam database is licensed under the GNU General Public


License, which basically makes it available to anyone, but imposes
the restriction that derivative works (new databases, modifications)
must be made available in source form.
PROSITE www.expasy.ch/prosite/
PROSITE is a database of protein families and domains. It consists of
biologically significant sites, patterns and profiles that help to
reliably identify to which known protein family (if any) a new
sequence belongs.
It was started by Amos Bairoch, is part of SWISS-PROT and is
maintained in the same way as SWISS-PROT. The basis of it
are regular expressions describing characteristic subsequences of
specific protein families or domains. PROSITE has been extended to
contain also some profiles, which can be described as probability
patterns for specific protein sequence families.
Regular expressions will be described in a lecture later in this course.
The site above can be used to search by keyword or other text in the
entries, to search for a pattern in a sequence, or to search for proteins
in SWISS-PROT that match a pattern. An example of a PROSITE
regular expression is given for the Ras GTPase-activating proteins
signature pattern (RAS_GTPASE_ACTIV_1, accession code
PS00509).

Primary and Secondary databases.


Primary Databases:
Databases consisting of data derivedexperimentally such as nucleotide sequences
and three dimentional structures are known as primary databases.
primary databases(consisting of data derived experimentally)

grown tremendously over the years

contains information of the sequence or structure alone and associated


annotation information

Secondary Dtabases:
Those data that are derived from the analysis or treatement of primary data such as
secondary structures, hydrophobicity plots, and domain are stored in secondary
databases

contains derived information from a primary database, like information


about conserved sequence, signature sequence and active site residues of
the protein families arrived by multiple sequence alignment of a set of related
proteins

secondary structure database contains entries of the PDB in an organized


way (for instance, by classification of all PDB entries according to structures
like alpha-helix or -sheets) and also information on conserved secondary
structure motifs of a particular protein

composite databases

joins a variety of different primary database sources, which obviates the


need to search multiple resources

Primary databases GenBank,


The GenBank sequence database is an open access,
annotated

collection

of

all

publicly

available nucleotide sequences


their protein translations.
This database is produced
the National

Center

Information(NCBI)

as

and
and
for

part

of

maintained

by

Biotechnology
the International

Nucleotide Sequence Database Collaboration(INSDC).


The National Center for Biotechnology Information is a
part of the National Institutes of Health in the United
States.
GenBank and its collaborators receive sequences
produced in laboratories throughout the world from more
than 100,000 distinct organisms.

In the more than 30 years since its establishment,


GenBank has become the most important and most
influential database for research in almost all biological
fields, whose data are accessed and cited by millions of
researchers around the world.
GenBank is built by direct submissions from individual
laboratories, as well as from bulk submissions from
large-scale sequencing centers.
Only original sequences can be submitted to GenBank.
Direct submissions are made to GenBank using BankIt,
which is a Web-based form, or the stand-alone
submission program, Sequin.
Upon receipt of a sequence submission, the GenBank
staff examines the originality of the data and assigns
an accession number to the sequence and performs
quality assurance checks.
The submissions are then released to the public database,
where

the

entries

downloadable by FTP.

EMBL,

are

retrievable

by Entrez or

The

EMBL

Nucleotide

Sequence

Database

(http://www.ebi.ac.uk/embl), maintained at the European


Bioinformatics Institute (EBI) near Cambridge, UK, is a
comprehensive collection of nucleotide sequences and
annotation from available public sources.
The database is part of an international collaboration
with DDBJ (Japan) and GenBank (USA).
Data are exchanged daily between the collaborating
institutes.
Webinis the preferred tool for individual submissions of
nucleotide sequences, including Third Party Annotation
(TPA) and alignments.
Automated procedures are provided for submissions
from large-scale sequencing projects and data from the
European Patent Office.
New and updated data records are distributed daily and
the whole EMBL Nucleotide Sequence Database is
released four times a year.
Access to the sequence data is provided via ftp and
several WWW interfaces.
With the web-based Sequence Retrieval System (SRS) it
is also possible to link nucleotide data to other specialist
molecular biology databases maintained at the EBI.

Other tools are available for sequence similarity


searching (e.g. FASTA and BLAST).

DDBJ,
The DNA Data Bank of Japan (DDBJ) is a biological
database that collects DNA sequences.
DDBJ Center collects nucleotide sequence data as a
member of INSDC (International Nucleotide Sequence
Database Collaboration) and provides freely available
nucleotide sequence data and supercomputer system, to
support research activities in life science.
It exchanges its data with European Molecular Biology
Laboratory at the European Bioinformatics Institute and
withGenBank at the National Center for Biotechnology
Information on a daily basis. Thus these three databanks
contain the same data at any given time.
DDBJ began data bank activities in 1986 at NIG and
remains the only nucleotide sequence data bank in Asia.
Although DDBJ mainly receives its data from Japanese
researchers, it can accept data from contributors from
any other country.

DDBJ is primarily funded by the Japanese Ministry of


Education,

Culture,

Sports,

Science

and

Technology (MEXT).
The principal purpose of DDBJ operations is to improve
the quality of INSD, as public domains.

SWISSPROT
UniProtKB is a compre hensive protein sequence
knowledgebase

that

consists

UniProtKB/Swiss-Prot,

which

of

two

contains

sections:
manually

annotated entries, and UniProtKB/TrEMBL which


contains computer annotated entries.
UniProtKB/Swiss-Prot entries contain

information

curated by biologists and provide users with cross-links


to about 100 external databases and with access to
additional information or tools.
SWISS-PROT is a curated protein sequence database
which strives to provide a high level of annotations (such
as the description of the function of a protein, its
domains

structure,

post-translational

modifications,

variants, etc.), a minimal level of redundancy and high


level of integration with other databases
The SWISS-PROT protein

sequence database consists of sequence entries. Sequence


entries are composed of different line types,
each with their own format. For standardization purposes
the format of SWISS-PROT follows as closely as
possible that of the EMBL Nucleotide Sequence
Database.
The SWISS-PROT database distinguishes itself from
other protein sequence databases by four distinct
criteria:
a) Annotation
b) Minimal redundancy
c) Integration with other databases
d) Documentation

and TrEMBL.
UniProtKB/TrEMBL

contains

high-quality

computationally analyzed records, which are enriched


with automatic annotation.
It was introduced in response to increased dataflow
resulting from genome projects, as the time- and labourconsuming

manual

annotation

process

of

UniProtKB/Swiss-Prot could not be broadened to include


all available protein sequences.

The translations of annotated coding sequences in


the EMBL-Bank/GenBank/DDBJ nucleotide sequence
database are automatically processed and entered in
UniProtKB/TrEMBL. UniProtKB/TrEMBL also contains
sequences

from PDB,

and

from

gene

prediction,

including Ensembl, RefSeqand CCDS.


Due to the nature of the source UniProtKB/TrEMBL is
highly redundant and the quality of the annotation is very
variable. As well as the original annotations carried over
from EMBL-Bank additional annotations are added
based on a series of automated annotation workflows.
As the entries in UniProtKB/TrEMBL and manually
reviewed by the UniProt curators they graduate into
UniProtKB/Swiss-Prot (the human curated section of
UniProtKB) and may be merged into existing entries
which describe the same gene in the same species.
The usual Swiss-Prot annotation pipeline involves the
manual annotation of TrEMBL entries, their integration
into Swiss-Prot, with their original accession number,
and subsequent deletion from TrEMBL.

Secondary databases

PROSITE,
PROSITE is a protein database.[1][2] It consists of entries
describing the protein families, domains and functional
sites as well as amino acid patterns and profiles in them.
It is based on the observation that, while there is a huge
number of different proteins, most of them can be
grouped, on the basis of similarities in their sequences,
into a limited number of families.
Proteins or protein domains belonging to a particular
family generally share functional attributes and are
derived from a common ancestor.
PROSITE currently contains patterns and profiles
specific for more than a thousand protein families or
domains.
Each of these signatures comes with documentation
providing background information on the structure and
function of these proteins.
The ProRule section of PROSITE is constituted of
manually created rules that can automatically generate
annotation in the UniProtKB/Swiss-Prot format based on
PROSITE motifs.

PROSITE's uses include identifying possible functions of


newly discovered proteins and analysis of known
proteins for previously undetermined activity.
PROSITE offers tools for protein sequence analysis and
motif detection (see sequence motif, PROSITE patterns).
It is part of the ExPASy proteomicsanalysis servers.

PRINTS,
PRINTS database

is

collection

of

so-called

"fingerprints"
it provides both a detailed annotation resource for protein
families, and a diagnostic tool for newly determined
sequences.
A fingerprint is a group of conserved motifs taken from
a multiple sequence alignment - together, the motifs form
a characteristic signature for the aligned protein family.
The motifs themselves are not necessarily contiguous in
sequence, but may come together in 3D space to define
molecular binding sites or interaction surfaces.

The particular diagnostic strength of fingerprints lies in


their ability to distinguish sequence differences at the
clan, superfamily, family and subfamily levels.
This allows fine-grained functional diagnoses of
uncharacterised sequences, allowing, for example,
discrimination between family members on the basis of
the ligands they bind or the proteins with which they
interact, and highlighting potential oligomerisation or
allosteric sites.
PRINTS is a founding partner of the integrated
resource, InterPro, a widely used database of protein
families, domains and functional sites.
FPScan - search PRINTS with a query sequence/ID

Pfam,
Pfam is a database of protein families that includes their
annotations and multiple sequence alignments generated
using hidden Markov models.
For each family in Pfam one can:
Look at multiple alignments

View protein domain architectures


Examine species distribution
Follow links to other databases
View known protein structures
Nearly 80% of protein sequences in the UniProt
Knowledgebase have at least one match to Pfam. [4] This
number is called the sequence coverage.
The Pfam database contains information about protein
domains and families.
Pfam-A is the manually curated portion of the database
that contains over 10,000 entries.
For each entry a protein sequence alignment and
a hidden Markov model is stored.
These hidden Markov models can be used to search
sequence databases

BLOCKS,
The Blocks Database contains multiple alignments of
conserved regions in protein families.

The database can be searched by e-mail and World Wide


Web (WWW) servers (http://blocks.fhcrc.org/help) to
classify protein and nucleotide sequences.
The description of a protein family by its conserved
regions focuses on the family's characteristic and
distinctive sequence features, thus reducing noise.
Databases of conserved features of protein families can
be utilized to classify sequences from proteins, cDNAs
and genomic DNA (25). An example is the Blocks
Database (3), which consists of ungapped multiple
alignments of short regions, called blocks (6).
The database was constructed from sequences of protein
families using a fully automated method. Searching the
Blocks database with a sequence query allows detection
of one or more blocks representing a family.
Each block entry is divided into header and sequences
parts. The header part consists of four lines. The ID
(identification) line contains the block's family short
description and identifies the entry as a block type.
The automated construction and extensive data in the
Blocks database make it suitable for uses other than
protein classification.

BioGRID
The Biological General Repository for Interaction
Datasets (BioGRID)

is

database of protein-protein
created in 2003
It strives to provide

a
and

curated biological
genetic

comprehensive

of proteinprotein and genetic interactions


majormodel

interactions

organism species while

resource
for

attempting

all
to

remove redundancy to create a single mapping of


interactions.
The Biological General Repository for Interaction
Datasets (BioGRID) database was developed to house
and distribute collections of protein and genetic
interactions from major model organism species.
Users of The BioGRID can search for their protein of
interest and retrieve annotation, as well as physical
and genetic interaction data as reported, by the primary
literature and compiled by in house large-scale curation
efforts.

Originally separated into organism specific databases, the


newest version now provides a unified front end allowing

for searches across several organisms simultaneously.


The BioGRID is funded by the BBSRC, NIH, and CIHR.

and InterPro.
InterPro provides functional analysis of proteins by
classifying them into families and predicting domains
and important sites.
InterPro is a database of protein families, domains and
functional sites in which identifiable features found in
known proteins can be applied to new protein
sequences[1] in order to functionally characterise them.
The contents of InterPro are based around diagnostic
signatures and the proteins that they significantly match.
The signatures consist of models (simple types, such
asregular expressions or more complex ones, such
as Hidden Markov models) which describe protein
families, domains or sites.
Models are built from the amino acid sequences of
known families or domains and they are subsequently
used to search unknown sequences (such as those arising
from novel genome sequencing) in order to classify
them.

Each of the member databases of InterPro contribute


towards a different niche, from very high-level, structurebased

classifications

Gene3D)

through

(SUPERFAMILY and
to

quite

specific

CATH-

sub-family

classifications (PRINTS and PANTHER).


InterPro contains three main entities: proteins, signatures
(also referred to as "methods" or "models") and entries.
Like other EBI databases, it is in the public domain,
since its content can be used "by any individual and for
any purpose"
IPRxxxxxx is

an

InterPro accession

number,

for

instanceIPR000001.
InterPro aims to release data to the public every 8 weeks,
typically within a day of the UniProtKB release of the
same proteins.

Structure databases
PDB,
The Protein Data Bank (PDB) is a repository for the
three-dimensional structural data of large biological
molecules, such as proteinsand nucleic acids. (See
also crystallographic database.)

The

data,

typically

crystallography or NMR

obtained

spectroscopy and

by X-ray
submitted

by biologists and biochemists from around the world, are


freely accessible on the Internet via the websites of its
member organisations .
The PDB is overseen by an organization called
the Worldwide Protein Data Bank, wwPDB.
The PDB is a key resource in areas ofstructural biology,
such as structural genomics.
Most major scientific journals, and some funding
agencies, now require scientists to submit their structure
data to the PDB.
If the contents of the PDB are thought of as primary data,
then there are hundreds of derived (i.e., secondary)
databases that categorize the data differently. For
example,

both SCOP and CATH categorize

structures

according to type of structure and assumed evolutionary


relations; GO categorize structures based on genes.
Each structure published in PDB receives a fourcharacter alphanumeric identifier, its PDB ID.

The structure files may be viewed using one of several


open source computer programs. Some other free, but not
open
[18]

source

programs

include

ICM-Browser,

VMD, MDL Chime, Pymol,UCSF Chimera, Rasmol,

Swiss-PDB Viewer,

SCOP,
The Structural Classification of Proteins (SCOP)
database is

largely

protein structural

manual

domainsbased

classification

of

similarities

of

on

their structures andamino acid sequences.


The Structural Classification of Proteins (SCOP)
database

provides

detailed

and

comprehensive

description of the relationships of known protein


structures.
The classification is on hierarchical levels: the first two
levels, family and superfamily, describe near and distant
evolutionary relationships; the third, fold, describes
geometrical relationships.

A motivation for this classification is to determine the


evolutionary relationship between proteins.
Proteins with the same shapes but having little sequence
or

functional

similarity

are

placed

in

different

"superfamilies", and are assumed to have only a very


distant common ancestor.
Proteins having the same shape and some similarity of
sequence and/or function are placed in "families", and
are assumed to have a closer common ancestor.
The SCOP database is freely accessible on the internet.
SCOP was created in 1994.[1]
The source of protein structures is the Protein Data Bank.
The unit of classification of structure in SCOP is
the protein domain.
The shapes of domains are called "folds" in SCOP.
Domains belonging to the same fold have the same major
secondary structures in the same arrangement with the
same topological connections.

The levels of SCOP are as follows.


Class: Types of folds, e.g., beta sheets.
Fold: The different shapes of domains within a class.
Superfamily: The domains in a fold are grouped into
superfamilies, which have at least a distant common
ancestor.
Family: The domains in a superfamily are grouped into
families, which have a more recent common ancestor.
Protein domain: The domains in families are grouped
into protein domains, which are essentially the same
protein.
Species: The domains in "protein domains" are grouped
according to species.
Domain: part of a protein. For simple proteins, it can be
the entire protein.

CATH,

The CATH Protein Structure Classificationis a semiautomatic, hierarchical classification of protein domains
CATH shares many broad features with its principal
rival, SCOP, however there are also many areas in which
the detailed classification differs greatly.

Only

crystal structures solved to resolution better than

4.0 angstroms are considered, together with NMR


structures.

Protein structures are classified using a combination of


automated and manual procedures. There are four major
levels in this hierarchy: Class, Architecture, Topology
(fold

family)

and Homologous

superfamily

(Orengo et al., 1997).


1 Class

the overall secondary-structure content of


the domain
high structural similarity but no evidence

2 Architecture

of homology. Equivalent to a fold


in SCOP

3 Topology

a large-scale grouping of topologies


which share particular structural features

indicative of a demonstrable evolutionary

Homologous

relationship. Equivalent to the

superfamily

superfamily level of SCOP.

Class is determined according to the secondary


structure composition and packing within the
structure. Three major classes are recognised;
mainly-alpha, mainly-beta and alpha-beta.

EuroCarbDB,
EuroCarbDB is an EU-funded initiative for the creation
of software and standards for the systematic collection
of carbohydrate structures and their experimental data.
The EUROCarbDB project is a design study for a
technical framework, which provides sophisticated,
freely accessible, open-source informatics tools and
databases
research.

to

support glycobiology

and

glycomic

EUROCarbDB is a relational database containing glycan


structures, their biological context and, when available,
primary and interpreted analytical data from highperformance liquid chromatography, mass spectrometry
and nuclear magnetic resonance experiments.
Database content can be accessed via a web-based user
interface.
The database

is

complemented

by

suite

of

glycoinformatics tools, specifically designed to assist the


elucidation and submission of glycan structure and
experimental data when used in conjunction with
contemporary carbohydrate research workflows
The project includes a database of known carbohydrate
structures and experimental data, specifically mass
spectrometry, HPLC and NMR data, accessed via a web
interface that provides for browsing, searching and
contribution of structures and data to the database.
The project has also produces a number
associated bioinformatics tools
researchers:
The online version

of

for

source project.

carbohydrate

EuroCarbDB is

the European Bioinformatics Institute.


Eurocarbdb is also an actively

of

hosted

by

developed open

A specific design objective of the architecture of the


database was to allow for the extension and incorporation
of new modules and tools to support further types of
experimental data and workflows.

PubChem Compound,
PubChem is a database of chemicalmolecules and their
activities against biological assays. The system is
maintained by theNational Center for Biotechnology
Information(NCBI), a component of the National Library
of Medicine, which is part of the United StatesNational
Institutes of Health (NIH).
PubChem can be accessed for free through a web user
interface.
Millions of compound structures and descriptive datasets
can be freely downloaded via FTP.
PubChem contains substance descriptions and small
molecules with fewer than 1000 atoms and 1000 bonds.
More than 80 database vendors contribute to the growing
PubChem database.
PubChem consists of three dynamically growing primary
databases. As of 7 January 2011:
Compounds,

Substances,
BioAssay,
PubChem Compound (4) is a searchable database of
chemical structures with validated chemical depiction
information provided to describe substances in PubChem
Substance.
Structures stored within PubChem Compounds are preclustered and cross-referenced by identity and similarity
groups.
PubChem Compound includes over 5M compounds.

Molecular Name Searches (e.g., Tylenol, Benzene) allow


searching with a variety of chemical synonyms,
Chemical Property Range Searches (e.g., Molecular
Weight between 100 and 200, Hydrogen Bond Acceptor
Count between 3 and 5) allow searching for compounds
with a variety of physical/chemical properties, and
descriptors.
Simple Elemental Searches (all compounds containing
Gallium)

allow

restrictions.

DrugBank,

searching

with

specific

element

The DrugBank database[1] is a comprehensive, highquality, freely accessible, online database containing
information on drugs and drug targets.
As

both

bioinformatics

and

cheminformatics resource, DrugBank combines detailed


drug (i.e. chemical, pharmacological and pharmaceutical)
data with comprehensive drug target (i.e. sequence,
structure, and pathway) information.[1][2]
Because of its broad scope, comprehensive referencing
and unusually detailed data descriptions, DrugBank is
more akin to a drug encyclopedia than a drug database.
As a result, links to DrugBank are maintained for nearly
all drugs listed in Wikipedia.
DrugBank is widely used by the drug industry, medicinal
chemists,pharmacists, physicians,

students

and

the

general public.
Its extensive drug and drug-target data has enabled the
discovery and repurposing of a number of existing drugs
to treat rare and newly identified illnesses.

The latest release of the database (version 4.0) contains


7677 drug entries including 1558 FDA-approved small
molecule drugs,

155

(protein/peptide)

drugs,

FDA-approved
87nutraceuticals and

biotech
over

6000experimental drugs.
Each DrugCard entry contains more than 200 data fields
with half of the information being devoted to
drug/chemical data and the other half devoted to drug
target or protein data.
All data in DrugBank is non-proprietary or is derived
from a non-proprietary source. It is freely accessible and
available to anyone.
In addition, nearly every data item is fully traceable and
explicitly referenced to the original source. DrugBank
data is available through a public web interface and
downloads.

ChemSpider
ChemSpider is a chemical database owned by theRoyal
Society of Chemistry.

The

database

contains

more

than

30

million

unique molecules from over 450 data sources including:


U.S. Food and Drug Administration (FDA), National
Institutes of Health ( NIH),QSAR, xPharm, ZINC
Each chemical is given a unique identifier, which forms
part of a corresponding URL
The ChemSpider database can be updated with user
contributions including chemical structure deposition,
spectra deposition and user curation.
This is a crowdsourcing approach to develop an online
chemistry database. Crowdsourced based curation of the
data has produced adictionary of chemical names
associated with chemical structures that has been used in
text-mining applications of the biomedical and chemical
literature.
A number of available search modules are provided:
The standard

search allows querying for

systematic

names, trade names and synonyms and registry numbers


The advanced search allows interactive searching by
chemical structure, chemical substructure, using also
molecular

formula

and molecular

weight range, CAS numbers, suppliers, etc. The search


can be used to widen or restrict already found results.
Structure searching on mobile devices can be done using
free

apps

for iOS(iPhone/iPod/iPad)[16] and

for

the Android (operating system).[17]

and Cambridge Structural Database.


The Cambridge Structural Database (CSD), is a
repository for small molecule crystal structures.

Scientists use single-crystal x-ray crystallography to


determine the crystal structure of a compound.

Once the structure is solved, information about the


structure is saved in a file (CIF format) and deposited in
the CSD.
Other scientists can search and retrieve structures from
the database.
The information consists of the space group symmetry of
the crystalline phase, its cell parameters, the relative
atomic coordinates of all the atoms in the cell in 3D.

Scientists can use the CSD to compare existing data with


that obtained from crystals grown in their laboratories.
The information can also be used to visualize the
structure

in

variety

of

software

such

as atoms, powdercell etc.


It

is

also

possible

to

calculate

what

the

theoretical powder diffraction pattern of the phase would


look like. This option is particularly important for
analytical reasons because it facilitates the identification
of phases present in a crystalline powder mixture without
the need for growing crystals.
Many of the small molecules are organic compounds of
the sort that could potentially act as medical drugs, and a
very important use of the CSD is for structural
comparisons among related molecules that can suggest
new leads for drug design.
The CSD is compiled and maintained by the Cambridge
Crystallographic Data Centre.
Each crystal structure undergoes extensive validation and
cross-checking by expert chemists and crystallographers

to ensure that the CSD is maintained to the highest


possible standards.
Also, each database entry is enriched with bibliographic,
chemical and physical property information, adding
further value to the raw structural data.
These editorial processes are vital for enabling scientists
to interpret structures in a chemically meaningful way.
The CSD is continually updated with new structures
(>40,000

new

structures

each

year)

and

with

improvements to existing entries.

With regular web-updates and early online access to


newly published structures you can keep fully informed
of the latest research.

Das könnte Ihnen auch gefallen