Beruflich Dokumente
Kultur Dokumente
INFORMATION RESOURCES
OBJECTIVE
After going through there unit, you should be able to understand various
biological databases like.
STRUCTURE:
1
9.1 INTRODUCTION
Biological Databases are computer sites that organise, store and disseminate files
that contain information consisting of literature references, nucleic aid sequences, protein
sequences and protein structure.
TYPES
1) Depending on the nature of information Sequence and structures 2D gel
Being stored (or) 3D structure images.
These are used to address different aspects of sequence analysis, because they
store different levels of protein sequence information.
In the early 1980, sequence information started to become more abundant in the
scientific literature. Realising this, several laboratories saw that there might be advantage
to harvesting and storing these sequences in central repositories. Thus several primary
database projects began to evolve in different part of the world.
2
Thus a Primary Database is a Database that stores bimolecular sequences
(protein or Nucleic acid) and associated annotation information (organism, species,
functions, Mutations linked to particular diseases, functional / structural patterns,
bibliographic etc.)
1. Gen Bank
Gen Bank is the DNA database from the National center for Biotechnology
information NCBI. NCBI is a division of National Library of Medicines located at
National Institute of Health (NIH) in Bethesda, Maryland.
As per the Release 127.0 on 15th December 2001, these are approximately 15,850
million bases in around 15 million sequence records.
2. EMBL
3. DDBJ
In case of public protein sequence databases, all have their uniqueness. The Major
protein sequence databases are.
3
PIR was developed at the National Biomedical Research Foundation in the early
1960 by Margaret Dayhoff as a collection of sequences for investigating evolutionary
relationships among proteins.
In its current form, the database is split into 4 sections PIR 1 to PIR 4.
MIPS Collects and processes sequence data for the tripartite PIR-International
Protein Sequence Database Project.
3. Swiss-Prot
4
(II) REM-Tr EMBL
- Immunoglobulins
- T-cell receptors
- Fragments of fever than eight amino acids.
- Synthetic Sequences
- Patented sequences
- Codon Translations (That do not encode real proteins)
5. NRL-3D
NRL-30 databases was produced by PIR from sequences extracted from the
Brookhaven Protein Data Bank (PDB) this provides.
5
The main composite databases are:
- Multiple copies of the same protein are retained in the database as a result
of polymorphisms and / or minor sequencing errors.
- Incorrect sequences that have been amended in SWISS-PROT are
reintroduced when translated from the DNA.
- Numerous sequences are incorporated as full entries of existing fragments.
- In view of this, the contexts of NRDB are both error-prone, and inspite of
its name redundant.
- NRDB is the default database of the NCBI BLAST Service.
2) OWL
The process eliminates both identical copies of sequences and those containing
single. Amino acid differences. OWL is released on 6-8 weekly basis.
Secondary Database are the one that Contain the information derived from
primary sequence data typically in the form of regular expressions (patterns),
fingerprints, blocks, profiles, (or) Hidden Markov Models.
6
- Homologous sequences may be gathered together in multiple alignments.
- Conserved regions (or) motifs are found.
- Conserved regions usually reflects some vital biological role (somehow) crucial to the
Structure or function of the protein)
Some of the major secondary databases are
PROSITE IDENTIFY
PRINTS PROFILES
BLOCKS Pfam
PROSITE:
Ex. Enzyare active site, Ligandmetal binding sites, Within PROSITE, motifs are encoded
as regular expressions, often referred as patterns, sometimes, a complete protein family
cannot be characterized effectively by a single- motif. In these cases additional patterns
are designed to encode other well conserved parts of the alignment,
1. Oral Prints
2. Blocks
7
4. Profiles
A position specific scoring table that encapsulates the sequence information with
complete alignments is termed as profile. Profiles define which residues are allowed at
given positions, which positions are conferred and which degenerate and which position
or regions can tolerate insertions.
The Principle is that the variable regions between conserved motifs also contain
valuable sequence information, the complete sequence alignment effectively becomes the
descriminator, profile is weighted to indicate where insertions and deletions are allowed
(INDEL). Hence profile is also called as weight matrices.
5. Pfam
1) Hand-edited seed alignment. These are accurate and used to produce Pfam- A.
ii) Those derived by automatic clustering of SWISSPROT. These are less reliable
. and used to produce Pfam-B.
All sequences that are not included in Pfam A are automatically clustered and
deposited in Pfam B.
Check Your Progress 4:
Mention few secondary databases?
_____________________________________________________________________
_____________________________________________________________________
_____________________________________________________________________
SCOP is maintained at the MRC Laboratory of Molecular Biology & centre for
protein Engineering describes structural and evolutionary relationship between proteins
of known structure.
8
SCOP Classification consists of (1) family, (2) Super family & (3) fold of the
protein sequence. SCOP is accessible for keyword interrogation via the MRC Laboratory
Web server.
2. CATH (Class, Architecture, Topology, Homology database)
There are many more databases, some of which provide very specialized
information. They are:
1. Gene Cards
Gene Cards is a database of human genes, their products and their involvement in
diseases. It offers concise information about an approved symbol.
4. UNIGENE
This database contain DNA & protein sequence, gene expression, cellular role, &
protein family information & taxonomic data for microbes, plants & humans.
9
6. ACeDB (A CAENORHABDITIS ELEGANS DATA BASE)
ACeDB arise from C. elegans geaome project. It includes restriction maps, gene
structural information, cosmic maps, sequence data, bibliographic references. It enable
the user to view genomic data at different stages of resolution, from the levels of a
complete chromosome down to the physical level.
What is KEGG?
9.10 SUMMARY:
* Databases are used to store the vast amounts of information issuing from the
Genome projects.
* Primary databases contain sequence data (Nucleic acid or Protein).
* Composite databases amalgamate a variety of different primary sources and are
Hence efficient to search.
* Secondary databases contain pattern data, i.e, diagnostic structures for protein
Families.
10
The computational process of interpreting the sequence of nucleotides in mRNA
via the genetic code to a sequence of Aminoacids, which may or may not code for
protein.
Contig:
Sequences of clones, representing overlapping regions of a gene, presenlid as an
assembly or multiple alignments.
Databases:
Collections of data in machine-readable form, which can be manipulated by
software to appear in varying arrangements and subsets.
Nucleotide:
A molecule consisting of a Nitrogenous base (A,G,T or C in DNA:A,G,U or C in
RNA), a phosphate moiety and a sugar group deoxyribose in DNA and ribose in DNA
and ribose in RNA). Thousands of nucleotides are linked to form DNA or RNA molecule.
Primary Database:
A database that stores biomolecular sequences (Proteins or Nucleic acid) and
associated annotation information (organism, species, function, mutations linked to
particular diseases, functional/structural patterns bibliographic etc).
Protein:
A molecule consists of one more chains of Amino acids in a specific order. The
order is determined by the base sequence of nucleotides in the gene coding for the
protein.
PIR:
A database of translated GenBank Nucleotide sequences. PIR is redundant protein
sequence database. The database is divided into 4 catogories.
PIR-1-Classifical and annotated
PPR-2-Annotated
PIR-3-Unverified
PIR-4-Unencoded / Untranslated.
Relational database:
A database that uses a relational data model, in which data are stored in 2-
dimensional tables. The tables embody different aspects or properties of data, but contain
overlapping information.
Secondary database:
A database that contains information derived from primary sequence data,
typically in the form of regular expressions (patterns), fingerprints, blocks, profiles or
Hidden Markov Models. These abstractions represent distillations of the most conserved
features of multiple alignments, such that they are able to provide potent discriminators
of family membership for newly determined sequences.
11
Single Nucleotide Polvemorphisms (SNPs):
SNPs are defined as single base-pair positions in genomic DNA that vary among
individuals in one or several populations.
Swiss-Prot:
A non-redundant Protein sequence Database, thoroughly annotated pand cross-
references. A subdivision is Tr EMBL.
9.12 ANSWER TO CHECK-UP YOUR PROGRESS:
(1) Biological Databases are computer sites that organize, store and disseminate files that
contain information consisting of literature reference, nucleic acid sequences, protein
sequences and protein structure.
(2) They are-Primary Databases
-Composite Databases
- Secondary Databases
(3) Composite database is a database that amalgamates a variety of different primary
sources. It renders sequence searching much more efficient, because they obviate the
need to interrogate multiple resources.
(4) Secondary Database are the one that Contain the information derived from primary
sequence data typically in the form of regular expressions (patterns), Fingerprints,
blocks, profiles, (or) Hidden Markov Models.
(5) It is an effort to computerize current knowledge of molecular & cellular biology in
terms of information pathways that consists of interacting molecules or genes and to
provide links from the gene catalogues produced by Genome sequencing projects.
9.13 SELF ASSESSMENT QUESTIONS:
1. What are Primary Databases? Explain in detail.
12
UNIT-X
SEQUENCE ANALYSIS
OBJECTIVE
STRUCTURE:
10.1 Introduction
10.2 Sequence similarity searches Pairwise Alignment Techniques.
10.3 Scoring matrices
10.3.1 Dayhoff Mutation Data Matrix.
10.3.2 The BLOSUM Matrices.
10.4 Dynamic Programming
10.5 Comparative Analysis by Pairwise Alignment
10.6 Tools
10.6.1 Fasta
10.6.2 Blast
10.7 Multiple Sequence Alignment
10.8 Multiple Alignment Tools.
10.9 Summary
10.10 Key words
10.11 Answer to checkup your Progress
10.12 Self Assessment Questions
10.13 Further Readings.
13
10.1 INTRODUCTION
Line up the sequences against each other & insert additional characters to being
the two strings into vertical alignment.
Unaligned
We could score the alignment by counting how many positions match identically
at each position. Here unaligned score is 6 and aligned score is 9.
14
alignment-score is then a function of the identify between aligned residues and the gap
penalties incurred.
10.3 SCORING MATRICES
Scoring matrices have been devised that weight matches between non-identical
residences. They are
Sequences clustered at greater than or equal to 80% identity are used to generate
the BLOSOM 80 metric. Those in the 62% or greater cluster contribute to the BLOSUM
62 matrix
Statistical values are used to indicate the level of confidence that should be
attached to an alignment. For pairwise alignments, these are usually formulated as
probability (P) values or expected frequency (E) values.
Alignments are models that reflect different biological perspectives. One model is
no more right or wrong than another.
a) Across the full extent of sequences which is called Global alignment. It uses
Needle man and Wunsch algorithm.
b) Across only part of sequences which is called local alignment. It uses smith-
waterman algorithm.
15
10.4 DYNAMIC PROGRAMMING
10.6 TOOLS
The Fast A and BLAST programs are local similarity search methods that
concentrate on finding short identical matches, which may contribute to a total match.
10.6.1. Fast A
The Fast A algorithm is based on the idea of identifying short words or K-tuples,
common to both sequences under comparison. K-tuples sizes of 1 0r 2 residues are used
in protein searches, while larger K-tuples (upto 6 bases) are used in DNA searches.
16
Fast A uses a heuristic approach to join K-tuples that lie close together on the
same diagonal. The regions formed in this way contain mismatches lying between
matching K-tuples. If a significant number of matches is found, Fast A uses a dynamic
programming algorithm to compute gapped alignments that incorporate the ungapped
regions.
Check Your Progress 3:
What is BLAST?
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
Analysis of groups of sequence is that form gene families requires the ability to
make connections between more than two family members. Multiple alignments are used
to reveal conserved family characteristics. Multiple alignments are simply models. There
is nothing inherently correct or incorrect about a particular alignment. The important
point is whether the model accurately reflects known biological data. Sequence and
structure based alignments are both imperfect models, since neither can reflect all levels
of biological informations. Both approaches are valid representations of particular aspects
of biology and neither should therefore be considered to represent some ultimate truth or
gold standard.
A multiple alignment can be defined as a 2D table in which the rows represent
individual sequences, and the columns the residence positions. Sequences are laid into
this grid in such a manner that
17
a) The relative positioning of residences within any one sequence is preserved and
b) Similar residues in all the sequences are brought into vertical register.
We call the residue position in an unaligned sequence, the absolute position, while the
aligned residue position is termed as relative position.
The time taken to compute an alignment rises exponentially with the number of
sequences to be aligned. Various methods have been developed that use heuristics to
reduce the time to find good (not necessarily optimal) alignments. Some approaches
combine dynamic programming with heuristics. Such techniques include aligning all
pairs of sequences, aligning each sequence with one specific sequence, aligning
sequences in arbitrary order or aligning sequences following the branching order of a
Phylogenetic tree.
Manual methods are often dismissed as being subjective. However the results of
automatic alignment programs almost invariably require manual polishing, and hence
alignment editors have become essential tools.
Simultaneous multiple alignment methods align all sequences within a set at once,
and hence are very time consuming. They work best on small sets of short sequences.
Clustal uses the positioning of gaps in closely related sequences to guide the insertion
section of gaps into those that are more distant. Similarly information compiled during
the alignment process about the variability of the most similar sequences is used to help
vary gap penalties on a residue & position specific basis.
There are numerous alignment databases accessible via the web. These result from
different approaches e.g. The application of automated methods to cluster the primary
sequence resources into families or from endeavors to produce gene family
discriminators for inclusion in secondary databases.
18
BLAST. Although fast to run, it has the disadvantage that the automated iterative search
may degenerate and lead to profile dilution.
What is Clustal W?
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
10.9 SUMMARY:
Algorithm:
A logical sequence of steps by which a task can be performed.
Alignment:
19
substitutions are derived from a scoring matrix such as the PAM and BLOSOM matrices
for proteins, and affine gap penalties suitable for the matrix are chosen. Alignment scores
are in log odd units, often bit units. Higher scores denote better alignments.
10.11 ANSWER TO CHECK YOUR PROGRESS:
(3) The important concept in this algorithm is that of the segment pair. Given two
sequences, a segement pair is defined as a pair of sub sequences of the same length
that form an ungapped alignment. BLAST calculates all segment pairs between the
query and the database sequences, above a scoring threshold.
20
UNIT-XI
GENOME ANALYSIS
OBJECTIVE
STRUCTURE:
11.1 Introduction
11.2 The Human Genome Project HGP
11.3 Yeast Genome Database
11.4 BACs
11.5 OMIM
11.6 KEGG Database
11.7 TIGR
11.8 MBGD
11.9 Summary
11.10 Key words
11.11 Answer to check-up your Progress
11.12 Self assessment Questions
11.13 Further Readings.
21
11.1 INTRODUCTION
This chapter introduces the history, goal & accomplishments of Human Genome
project is a range of specialist genome information resources like KEGG, TIGR,
MBGD,OMIM,YAC,BACs.
Begun formally in 1990, the U.S. Human Genome Project was a 13-year effort
co-ordinate by the U.S. Department of Energy and the National Institutes of Health. The
project originally was planned to last 15 years, but rapid technological advances
accelerated the completion date to 2003. Project goals were to
To help achieve these goals, researchers also studied the genetic makeup of several
nonhuman organisms. These include the common human gut bacterium Escherichia coli,
the fruit fly, and the laboratory mouse.
A unique aspect of the U.S. Human Genome Project is that it was the first large
scientific undertaking to address potential ELSI implications arising from project data.
Sequence and analysis of the human genome working draft was published in February
2001 and April 2003 issues of Nature and Science.
22
Human Genome Project Goals and Completion Dates
23
11.3 YEAST GENOME DATABASE
The yeast Saccharomyces cerevisiae is clearly the most ideal eukaryotic microorganism
for biological studies. The awesome power of yeast genetics has become legendary and
is the envy of those who work with higher eukaryotes. The complete sequence of its
genome has proved to be extremely useful as a reference to wards the sequences of
human and other higher eukaryotic genes. Furthermore, the ease of genetic manipulation
of yeast allows its use for conveniently analyzing and functionally dissecting gene
products from other eukaryotes.
Genomic analysis
Many diverse studies require the determination of the abundance of large numbers of
specific DNA or RNA molecules in complex mixtures, including, for example, the
determination of the changes in mRNA levels of many genes. While a number of
techniques have been used to estimate the relative abundance of two or more sets of
mRNA, such as differential screening of cDNA libraries, subtractive hybridization, and
differential display, far more superior methods have been recently developed that are
particularly amenable to organisms whose entire genocome sequences are known, such as
S.cerevusuae. It is now practicable to investigate changes of mRNA levels of all yeast
ORFs in one experiment.
The following procedures have been successfully used for determining mRNA levels in
yeast: (i) the DNA Microarray System; (ii) the Oligonucleotide Microarray System; (iii)
the Low-density DNA Array System; and (iv) the kRT-PCR System.
24
Inheritance ____ Mendelian ____ ________________Non-Mendelian _________
Nucleic acid ______Double-stranded DNA________ ___Double stranded RNA__
Location ______Nucleus _____ __________ Cytoplasm___________________
Genetic determinant
Chromosomes 2-mm Mitochondrial RNA Viruses
Relative amount 85% Plasmid DNA L-A M L-BC T W
Number of copies 2 sets of 16
5% 10% 80% 10% 9% 0.5 05%
Size (Kb) 13,500(200-2,200) 60-100 -50(8-130) 103 170 150 10 10
Deficiencies in llokk
6.318 70.76 4.576 1.8 4.6 2.7 2.25
mutants All kinds
Killer oxin None
YFG1+ None Cytodiromes
Wild-type YFG1
yfg1-1
Mutant or variant Cir+ P+ KIL-k1
Ciro P- KIL-0
The accessibility of the yeast genome for genetic manipulation and the available
techniques to introduce exogenous DNA into yeast cells has led to the development of
methods for analyzing and preparing DNA and proteins not only from yeast itself, but
also from other organisms. For example, many mammalian homologs of yeast genes have
been cloned by using heterologous cDNA expression libraries in yeast expression vectors.
Also, yeast is being used to investigate the detailed functions of heterologous proteins,
such as mammalian transcription factors and nuclear hormone receptor. In fact, like
E.coli, yeast has become a standard microorganism for carrying out special tasks, some of
which are described in this section.
A yeast artificial chromosome (short YAC) is a vector used to clone large DNA
fragments (larger than 100 kb and up to 3000 kb). It is an artificially constructed
chromosome and contains the telomeric, centromeric, and replication origin sequences
needed for replication and preservation in yeast cells. Built using an initial circular
plasmid, they are linearised by using restriction enzymes, and then DNA ligase can add a
sequence or gene of interest within the linear molecule by the use of cohesive ends. They
were first described in 1983 by Murray & Szostack.
25
YACs are extremely useful as one can get eukaryotic protein products with
posttranslational modifications as yeasts are themselves eukaryotic cells, however YACs
have been found to be more unstable than BACs, producing chimeric effects. Before the
advent of the Human Genome Project, YACs and BACs were used to map sections of
DNA of interest when hunting for specific genes.
What are all the various procedures used for determining mRNA levels in yeast?
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
BAC are often used to sequence the genetic code of organisms in genome projects, for
example the Human Genome Project. A short piece of the organisms DNA is amplified
as an insert in BACs, and then sequenced. Finally, the sequenced parts are rearranged in
silicon, resulting in the genomic sequence of the organism.
26
BACs are now being utilized to a greater extent in modeling genetic diseases, often
alongside transgenic mice. BACs have been useful in this field as complex genes may
have several regulatory sequences upstream of the encoding sequence, including various
promoter sequences that will govern a genes expression level. BACs have been used to
some degree of success with mice when studing neurological diseases such as
Alzheimers disease or as in the case of aneuploidy associated with Down syndrome.
There have also been instances when they have been used to study specific oncogenes
associated with cancers. They are transferred over to these genetic disease models by
electroporation/transformation, transfection with a suitable virus or microinjection. BACs
can also be utilized to detect genes or large sequences of interest and then used to map
them onto the human chromosome using BAC arrays. BACs are preferred for these kind
of genetic studies because they accommodate much larger sequences without the risk of
rearrangement, and are therefore more stable than other types of cloning vectors.
The Mendelian Inheritance in Man Project is a database that catalogues all the known
disease with a genetic component, and when possible-links them to the relevant genes
in the human genome and provides references for further research and tools for genomic
analysis of a catalogued gene.
Versions
I t is available as a book name after the project, and it is currently in its 12th edition. The
online version is called Online Mendelian Inheritance in ManTM (OMIMTM), which can
be accessed with the Entrez database searcher of the National Library of Medicine and is
part of the NCBI project Education.
Collection Process
The information in this database is collected and processed under the leadership of
Dr.Victor A. Mckusick at johns Hopkins University, assisted by a team of science writers
and editors. Relevant articles are identified, discussed and written up in the relevant
entries in the MIM database.
Every disease and gene is assigned a six digit number of which the first number classifies
the method of inheritance.
27
First Range of MIM Method of Inheritance
Digit Codes
1 100000-199999 Autosomal dominant loci or phenotypes (created before
May 15,1994)
2 200000-299999 Autosomal recessive loci or phenotypes (created before
May 15,1994)
3 300000-399999 X-linked loci or phenotypes
4 400000-499999 Y-linked loci or phenotypes
5 500000-599999 Mitochondrial loci or phenotypes
6 600000- Autosomal loci or phenotypes (created after May 15,1994)
Symbols:
A number symbol (#) before an entry number indicates that it is a descriptive entry,
usually of a phenotype.
A plus sign (+) before an entry number indicates that the entry contains the description of
a gene of known sequence and a phenotype.
A percent sign (%) before an entry number indicates that the entry describes a confirmed
mendelian phenotype or phenotypic locus for which the underlying molecular basis is not
known.
A caret symbol (^) before an entry number means the entry no longer exists because it
was removed from the database or moved to another entry as indicated.
28
Check Your Progress 3:
The KEGG database was initiated by the Japanese human genome programme in
1995 and it part of the Kyoto Encyclopedia of Genes and Genomes. According to the
developers they consider KEGG to be a computer representation of the biological
system. The KEGG database can be utilized for modeling and simulation, browsing and
retrieval of data. It is a part of the systems biology approach.
* KEGG Pathway
TIGRs Genome Projects are a collection of curated databases containing DNA and
Protein sequence, gene expression, cellular role, protein family, and taxonomic data for
microbes, plants and humans. The access to the data is facilitated by TIGRs Internet 2
high-speed research network connection which is supported in part by the National
Science Foundation under grant ANI-0333537. Anonymous FTP access to sequence data
is also provided. We can also access the following in this database.
nes
* KEGG Ligand
* KEGG BRITE
Databases
29
KEGG Pathways:
* Metabolism
* Gentic Information
Processing
* Environmental
Information
Processing
* Cellular Processes
* Human Diseases
* Drug Development
Ligand Database:
* Compound
* Drug
* Glycan
* Reaction
* RPAIR
* Enzym
Plant Genomics
The TIGR Castor bean Database Provides links to the castor bean genome project at
TIGR and includes sequencing and assembly of a 4X draft of the ~400 Mbp genome
using a whole genome shotgun strategy and, ~50,000 ESTs from different tissues to aid in
gene discovery and annotation. This project is funded by the NIAID-NIH, through the
Microbial Genome Sequencing Center at TIGR.
30
TIGR Plant Transcript Assemblies represent clustered, assemblies of all transcripts for
~ 140 plant species and can be accessed here.
The TIGR Arubidopsis thaliana Database provides access to genomic sequence data
and annotation generated at TIGR and assemblics of Arabidopsis ESTs from world-wide
sequencing projects.
Potato Functional Genomics Project provides links to the NSF-funded potato genome
project at TIGR and includes sequence data, annotation, and links to the Solamum
tuberosum Gene Index.
The TIGR Maize Database Provides links to the NSF-funded Consortium for Maize
Genomics project and includes sequence, assembly and annotation data and links to the
Maize Gene Index.
TIGR Plant Repeat Databases is a collection of repetitive sequences for 12 plant genera
and four plant families.
The TIGR Loblolly Pine Functional Genomics Project , in collaboration with the
Institute of Paper Science and Technology and funded by the National Science
Foundation, can be accessed here.
Maize Oligonucleotide Array Project produce and distribute to the research community
high density microarrays for the maize genome. This site will contain information on
project goals, participants, array availability, and data access.
31
Rice Oligonucleotide Array project produce and distribute to the research community a
whole genome microarray for the rice genome and link this information to the rice
genome.
What is TIGR?
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
Parasite Projects
The TIGR Parasites Database provides links to TIGR sequencing projects completed
and underway as well as links to related world-wide sequencing efforts.
The TIGR Tetrahymena the ermophila genome database provides links to the NSF
and NIH-NIGMS funded Tetrahymena genome project at TIGR
The TIGR Vector Genomics Database provides links to TIGRs sequencing efforts in
area of Aedes aegypti vector genomics..
32
11.8 MBGD
More than 300 genomic sequences have been determined to date, and the number of
completed sequences continues to grow. Extracting useful information from such a
growing number of genomes is a major challenge in comparative genomics. Interestingly,
many of the completed genomic sequences are closely related to each other, of the 293
genomic sequences available at the end of 2005, the number of unique species (for which
at least one genome sequence was determined) is 211, and the number of unique genera is
only 135. It is important to conduct comparative analyses not only of distantly related
genomes, but also of closely related genomes, since we can extract different types of
information about biological functions and evolutionary processes from comparisons of
genomes at different evolutionary distances.
What is MBGD?
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
33
algorithm for constructing ortholog groups at the domain (rather than gene) level from
precomputed all-against-all similarity relationships. With this algorithm, MBGD not only
provides the orthologous groups among the latest genomic data available, but also allows
users to create their own ortholog groups using a specified set of organisms. The latter
feature is especially useful when the users interest is focused on some taxonomically
related organisms; in fact, MBGD is most effectively used when an appropriate number
of genomes are selected. However, in the previous version, users could only choose
published genomes whose sequences were already available in MBGD.
With the growing amount of the microbial genomic information in mind, we have
started a new service called My MBGD, which allows users to add their own genome
sequences to MBGD for the purpose of finding orthologous relationships among the
newly added genomes and the existing genomes. Furthermore, in order to facilitate
comparisons of closely related genomes, we have also enhanced the interface the
interface of pairwise comparison using the CGAT interface, which is a Java applet for
displaying genome and alignment viewers.
MBGD Function
11.9 SUMMARY
Base pair:
A partial sequence of a clone, randomly selected from cDNA library and used to
identity genes expressed in a particular tissue. ESTs are used extensively in projects to
map the human genome.
34
Flat file:
Genome:
All the genetic mativial in the chromosomes of a particular organism. Its size is
generally given as its total no. of base pairs.
Genome Projects:
Imitative to map e sequence the entire genomes of particular organisms. The first
complete eukaryotic genome to have been sequenced is that of the yeast s. corrosive.
Micro arrays:
MMDB:
NCBI:
The US National center for Biotechnology information.
NIH:
The US National Institute of Health.
35
11.11 ANSWER TO CHECK-UP YOUR PROGRESS
2. The following procedures have successfully used for determining mRNA levels in
yeast;(1) the DNA Microarray System; (ii) the Oligonuclet-otide Microarray System;
(iii) the Low-density DNA Array System; and (iv) kRT-PCR System.
4. TIGRs Genome Projects are a collection of curated databases containing DNA and
protein sequence, gene expression, cellular role, protein family, and taxonomic data for
microbes, plants and humans.
1. Introduction to Bio-Informatics-Attwood.
2. Bio-Informatics Sequence and Genomes David, W.Mount.
3. Bio-Informatics methods is Applications S.C. Rastogi.
4. Bio-Informatics C.S.V. Murthy.
36
Homology relationships
Organism selection
Parameter seeting Ortholog cluster table
37