Rese Rach

UNIT-IX
INFORMATION RESOURCES
OBJECTIVE
After going through there unit, you should be able to understand various
biological databases like.
Protein Sequence databases.

Nucleotide databases.
Structural databases.
Specialized databases.
STRUCTURE:
9.0 Learning Objective

9.1 Introduction
9.2 Biological Databases
9.3 Classification of Biological Databases
9.3.1 Classification of Biological Databases
(based on the Bio-molecules Analysis)
9.4 Primary sequence Databases.
9.5 Protein sequence Databases
9.6 Composite Databases
9.7 Secondary Database (or) Pattern Database
9.8 Structural classification Database
9.9 Specialised Databases
9.10 Summary
9.11 Key words
9.12 Answer to check your Progress
9.13 Terminal Questions
9.14 Further Readings
1
9.1 INTRODUCTION
The aim of this chapter is to provide on introduction to a range of biological

databases, highlighting the distinction between different data types and indicating where
some of the most important resources are maintained. The chapter discusses primary
sequence databases, composite sequence databases and variety of secondary pattern
databases. Two structure classification resources are also briefly mentioned.
9.2 BIOLOGICAL DATABASES
Biological Databases are computer sites that organise, store and disseminate files
that contain information consisting of literature references, nucleic aid sequences, protein
sequences and protein structure.
Databases are effectively electronic filling cabinets, a convenient and efficient

method of storing vast amount of information.
9.3 CLASSIFICATION OF BIOLOGICAL DATABASES:-

There are many different database types,
TYPES
1) Depending on the nature of information Sequence and structures 2D gel
Being stored (or) 3D structure images.
2) Manner of data storage Flat files (or) Tables in

Relational Database (or)
objects in an object oriented
database.
9.3.1 CLASSIFICATION OF BIOLOGICAL DATABASES:-

Based on protein sequence, Analysis, and Nucleotide sequence analysis, the
Biological Databases are classified into three types.
They are - Primary Databases
- Composite Databases
- Secondary Databases
These are used to address different aspects of sequence analysis, because they
store different levels of protein sequence information.
9.4. PRIMARY SEQUENCE DATABASES:
In the early 1980, sequence information started to become more abundant in the
scientific literature. Realising this, several laboratories saw that there might be advantage
to harvesting and storing these sequences in central repositories. Thus several primary
database projects began to evolve in different part of the world.
2
Thus a Primary Database is a Database that stores bimolecular sequences
(protein or Nucleic acid) and associated annotation information (organism, species,
functions, Mutations linked to particular diseases, functional / structural patterns,
bibliographic etc.)
a) Nucleic Otide Databases:
The principal DNA sequence databases are

1. Gen Bank (USA)
2. EMBL (Europe)
3. DDBJ (Japan)
They exchange data on daily basis to ensure comprehensive coverage at each of

the sites.
1. Gen Bank
Gen Bank is the DNA database from the National center for Biotechnology
information NCBI. NCBI is a division of National Library of Medicines located at
National Institute of Health (NIH) in Bethesda, Maryland.
As per the Release 127.0 on 15th December 2001, these are approximately 15,850
million bases in around 15 million sequence records.
2. EMBL
EMBL is the DNA database which is maintained by the European Molecular

Biology Laboratory or more specifically the European Bioinformatics Institute (EBI) at
Hinxton Hall, UK. As per release 69 on 1st December 2001, it had 14.4 million sequences.
3. DDBJ
DDBJ, DNA Databank of Japan, began in 1986 is produced, Maintained and

distributed at the National Institute of Genetics. It is situated in Mishima, Japan. It has 15
million sequences according to release 48 in January 2002.
9.5 PROTEIN SEQUENCE DATA BASES
In case of public protein sequence databases, all have their uniqueness. The Major
protein sequence databases are.
- PIR (Protein Information Resource)

- MIPS
- SWISS-PROT
- Tr EMBL (Translated EMBL)
- NRL-3D
PIR (Protein Information Resource)
3
PIR was developed at the National Biomedical Research Foundation in the early
1960 by Margaret Dayhoff as a collection of sequences for investigating evolutionary
relationships among proteins.
In its current form, the database is split into 4 sections PIR 1 to PIR 4.
PIR 1 contains fully classified and annotated entries.

PIR 2 Includes Preliminarily entries.
PIR 3 Contains unverified entries
PIR 4 Entries fall into
- Conceptual translations sequences

- Protein sequences
- Conceptual translations of art factual sequences
- Sequences that are not geactically encoded and not produced in ribosomes.
2. MIPS (Martin sried Institute for Protein Sequences)
MIPS Collects and processes sequence data for the tripartite PIR-International
Protein Sequence Database Project.
3. Swiss-Prot
SWISS-PROT is a Protein Sequence Database found out in 1986. It was produced

by the Department of Medical Biochemistry at the University of Geneva and the EMBL.
The database provide
- High level annotations.

- Descriptions of the function of proteins.
- Structure of the domains.
- Its post translational modifications.
- Variants etc.
4. Tr EMBL (Translated EMBL)
Tr EMBL was created in 1996 as a computed Annotated supplement to SWISS-

PROT. The database helps the SWISS-PROT format and contains translations of all
coding sequences (CDS) in EMBL.
IT HAS TWO MAIN SECTION AS FOLLOWS:

(1) SP-Tr EMBL (SWISS-SPORT Tr EMBL)
Contains entries that will eventually be incorporated into SWISS-PROT. They

have not yet been manually annotated.
4
(II) REM-Tr EMBL
Contains sequences that are not destined to be included in SWISS-PROT. These

include,
- Immunoglobulins
- T-cell receptors
- Fragments of fever than eight amino acids.
- Synthetic Sequences
- Patented sequences
- Codon Translations (That do not encode real proteins)
5. NRL-3D
NRL-30 databases was produced by PIR from sequences extracted from the
Brookhaven Protein Data Bank (PDB) this provides.
- Bibliographic references Binding site

- MEDLINE cross references Modified site annotations
- Secondary structure - Details of Experimental method
- Active site - Resolution
- Key word - R factors etc.
Check Your Progress1:
What are biological Databases?

________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
Check Your Progress2:
Classify biological databases?

________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
9.6 COMPOSITE PROTEIN SEQUENCE DATABASES
Composite database is a database that amalgamates a variety of different primary

sources. It renders sequence searching much more efficient, because they obviate the
need to interrogate multiple resources. The Interrogation process is streamlined still
further, if the composite has been designed to be non-redundant. It means the same
sequence need not be searched more than once.
5
The main composite databases are:
I) NRDB A Non -Identical composite protein sequence Database

II) OWL - A Non redundant composite.
1) NRDB (Non-Redundant database)
It is build at NCBI. It is non-redundant but non identical This sophistic

approach leads to a number of problems.
- Multiple copies of the same protein are retained in the database as a result
of polymorphisms and / or minor sequencing errors.
- Incorrect sequences that have been amended in SWISS-PROT are
reintroduced when translated from the DNA.
- Numerous sequences are incorporated as full entries of existing fragments.
- In view of this, the contexts of NRDB are both error-prone, and inspite of
its name redundant.
- NRDB is the default database of the NCBI BLAST Service.
2) OWL
OWL is a non-redundant database at the University of Leeds in collaboration

with the Daresburry laboratory in Warrington. The database is a composite of 4 major
primary sources.
SWISS PROT GENBANK

PIR 1-4 NRL-3D
The process eliminates both identical copies of sequences and those containing
single. Amino acid differences. OWL is released on 6-8 weekly basis.
Check Your Progress 3:
Define composite databases?

_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
9.7 SECONDARY DATABASE (OR) PATTERN DATABASE
Secondary Database are the one that Contain the information derived from
primary sequence data typically in the form of regular expressions (patterns),
fingerprints, blocks, profiles, (or) Hidden Markov Models.
The type of information stored in each of the secondary databases is different.

These resources are from a common principle.
6
- Homologous sequences may be gathered together in multiple alignments.
- Conserved regions (or) motifs are found.
- Conserved regions usually reflects some vital biological role (somehow) crucial to the
Structure or function of the protein)
Some of the major secondary databases are
PROSITE IDENTIFY
PRINTS PROFILES
BLOCKS Pfam
PROSITE:
The first secondary database developed is PROSITE. Which is maintained at the

swiss Institute of Bio informatics. The rationale blind its development was the protein
families could be simply & effectively characterized by the single most conserved motif
observable in multiple alignment of known homologues. The motifs usually encode key
biological functions.
Ex. Enzyare active site, Ligandmetal binding sites, Within PROSITE, motifs are encoded
as regular expressions, often referred as patterns, sometimes, a complete protein family
cannot be characterized effectively by a single- motif. In these cases additional patterns
are designed to encode other well conserved parts of the alignment,
1. Oral Prints
PRINT is a database of protein motifs. It was fingerprints composed of more than

one pattern to characterise an entire protein sequence. Groups of motifs found in a
sequence family can define a signature for that family. It is maintained in the department
of Biochemistry and Molecular Biology at University College. London (UCL)
Within prints, motifs are encoded as ungapped, unweighted local alignments.

PRINTS privet the raw material for automatically derived Tertiary databases.
2. Blocks
Blocks are a multiple motif databases. Motif or Blocks are created by

automatically detecting the most highly conserved regions of each protein family. Blocks
database contains more than 4000 entries.
3. Identify
Another automatically derived tertiary resource, derived from BLOCKS and

PERINTS is IDENTIFY. It is produced in the department of Biochemistry at Stanmore
University. The program used in this is a MOTIF. It is based on the generation of
consensus expressions from conserves regions of sequence alignments.
7
4. Profiles
A position specific scoring table that encapsulates the sequence information with
complete alignments is termed as profile. Profiles define which residues are allowed at
given positions, which positions are conferred and which degenerate and which position
or regions can tolerate insertions.
The Principle is that the variable regions between conserved motifs also contain
valuable sequence information, the complete sequence alignment effectively becomes the
descriminator, profile is weighted to indicate where insertions and deletions are allowed
(INDEL). Hence profile is also called as weight matrices.
5. Pfam
It is a database of alignments of protein domain families. This database is based

on a distinct classes of alignments.
1) Hand-edited seed alignment. These are accurate and used to produce Pfam- A.
ii) Those derived by automatic clustering of SWISSPROT. These are less reliable
. and used to produce Pfam-B.
All sequences that are not included in Pfam A are automatically clustered and
deposited in Pfam B.
Mention few secondary databases?
_____________________________________________________________________
_____________________________________________________________________
_____________________________________________________________________
9.8 STRUCTURE CLASSIFICATION DATABASES
The most important structural classification databases are.

i) SCOP
ii) CATH
1. SCOP (Structural classification of proteins)
SCOP is maintained at the MRC Laboratory of Molecular Biology & centre for
protein Engineering describes structural and evolutionary relationship between proteins
of known structure.
SCOP has been constructed using a combination of manual inspection and

automated methods.
8
SCOP Classification consists of (1) family, (2) Super family & (3) fold of the
protein sequence. SCOP is accessible for keyword interrogation via the MRC Laboratory
Web server.
2. CATH (Class, Architecture, Topology, Homology database)
CATH is maintained at UCC. The resources is largely derived using automatic

methods, but manual inspection is necessary where automatic methods fail.
There are 5 levels with this hierarchy. They are
1. Class
2. Architecture
3. Topology
4. Homology
5. Sequence.
9.9 SPECIALISED DATABASES
There are many more databases, some of which provide very specialized
information. They are:
1. Gene Cards
Gene Cards is a database of human genes, their products and their involvement in
diseases. It offers concise information about an approved symbol.
2. KEGG (Kyoto Encyclopedia of Genes & Genomes)
It is an effort to computerize current knowledge of molecular & cellular biology

in terms of information pathways that consists of interacting molecules or genes and to
provide links from the gene catalogues produced by Genome sequencing projects.
3. SGD (Sacharromyces Genome Database)
SGD is an resource which brings together the information on the molecular

biology & genestics of S. Cerevisiae.
4. UNIGENE
UniGene provide a transcript map by utilizing sets of non-redundant gene-

oriented clusters derived from GenBank sequences. The collection represents genes from
many organisms, each cluster relating to a unique gear & including related information,
such as the tissue type in which the gene is expressed, map location etc.
5. TDB (TIGR DATA BASE)
This database contain DNA & protein sequence, gene expression, cellular role, &
protein family information & taxonomic data for microbes, plants & humans.
9
6. ACeDB (A CAENORHABDITIS ELEGANS DATA BASE)
ACeDB arise from C. elegans geaome project. It includes restriction maps, gene
structural information, cosmic maps, sequence data, bibliographic references. It enable
the user to view genomic data at different stages of resolution, from the levels of a
complete chromosome down to the physical level.
What is KEGG?
9.10 SUMMARY:
* Databases are used to store the vast amounts of information issuing from the
Genome projects.
* Primary databases contain sequence data (Nucleic acid or Protein).
* Composite databases amalgamate a variety of different primary sources and are
Hence efficient to search.
* Secondary databases contain pattern data, i.e, diagnostic structures for protein
Families.
9.11 KEY WORDS:

Amino acid:
Fundamental building blocks of proteins. There are 20 naturally occurring amino
acids in animals and around 100 more found only in plants.
Bio-Informatics:
Bio-Informatics is the application of computational techniques to the management
and analysis of biological information.
Block:
An ungapped, aligned motif consisting of sequence segments that are clustered to
reduce multiple contributions from groups of highly similar or identical sequences.
Composite database:
A database that amalgamates a number of primary sources, using a set of defined
criteria that determine the priority of inclusion of the different sources and the level of
redundancy retained.
Conceptual Translation:
10
The computational process of interpreting the sequence of nucleotides in mRNA
via the genetic code to a sequence of Aminoacids, which may or may not code for
protein.
Contig:
Sequences of clones, representing overlapping regions of a gene, presenlid as an
assembly or multiple alignments.
Databases:
Collections of data in machine-readable form, which can be manipulated by
software to appear in varying arrangements and subsets.
Nucleotide:
A molecule consisting of a Nitrogenous base (A,G,T or C in DNA:A,G,U or C in
RNA), a phosphate moiety and a sugar group deoxyribose in DNA and ribose in DNA
and ribose in RNA). Thousands of nucleotides are linked to form DNA or RNA molecule.
Primary Database:
A database that stores biomolecular sequences (Proteins or Nucleic acid) and
associated annotation information (organism, species, function, mutations linked to
particular diseases, functional/structural patterns bibliographic etc).
Protein:
A molecule consists of one more chains of Amino acids in a specific order. The
order is determined by the base sequence of nucleotides in the gene coding for the
protein.
PIR:
A database of translated GenBank Nucleotide sequences. PIR is redundant protein
sequence database. The database is divided into 4 catogories.
PIR-1-Classifical and annotated
PPR-2-Annotated
PIR-3-Unverified
PIR-4-Unencoded / Untranslated.
Relational database:
A database that uses a relational data model, in which data are stored in 2-
dimensional tables. The tables embody different aspects or properties of data, but contain
overlapping information.
Secondary database:
A database that contains information derived from primary sequence data,
typically in the form of regular expressions (patterns), fingerprints, blocks, profiles or
Hidden Markov Models. These abstractions represent distillations of the most conserved
features of multiple alignments, such that they are able to provide potent discriminators
of family membership for newly determined sequences.
11
Single Nucleotide Polvemorphisms (SNPs):
SNPs are defined as single base-pair positions in genomic DNA that vary among
individuals in one or several populations.
Swiss-Prot:
A non-redundant Protein sequence Database, thoroughly annotated pand cross-
references. A subdivision is Tr EMBL.
9.12 ANSWER TO CHECK-UP YOUR PROGRESS:
(1) Biological Databases are computer sites that organize, store and disseminate files that
contain information consisting of literature reference, nucleic acid sequences, protein
sequences and protein structure.
(2) They are-Primary Databases
-Composite Databases
- Secondary Databases
(3) Composite database is a database that amalgamates a variety of different primary
sources. It renders sequence searching much more efficient, because they obviate the
need to interrogate multiple resources.
(4) Secondary Database are the one that Contain the information derived from primary
sequence data typically in the form of regular expressions (patterns), Fingerprints,
blocks, profiles, (or) Hidden Markov Models.
(5) It is an effort to computerize current knowledge of molecular & cellular biology in
terms of information pathways that consists of interacting molecules or genes and to
provide links from the gene catalogues produced by Genome sequencing projects.
9.13 SELF ASSESSMENT QUESTIONS:
1. What are Primary Databases? Explain in detail.
2. Briefly explain various Structural Classification Database.
3.14 FURTHER READINGS:

1. Introduction to Bio-informatics Attwood.
2. Bio-Informatics: Sequences and Genomes David.W.Mount.
3. Bio-Informatics-methods and Applications S.C. Rastogi.
4. Bio-Informatics C.S.V. Meuthy.
12
UNIT-X
SEQUENCE ANALYSIS
OBJECTIVE
After going through this unit, you should be able to understand
Sequence similarity searches.

Pairwise sequence Alignment
Multiple sequence Alignment
Scoring Matrices
Vaious Tools like Fasta, Blast, Clustol W.
STRUCTURE:
10.1 Introduction
10.2 Sequence similarity searches Pairwise Alignment Techniques.
10.3 Scoring matrices
10.3.1 Dayhoff Mutation Data Matrix.
10.3.2 The BLOSUM Matrices.
10.4 Dynamic Programming
10.5 Comparative Analysis by Pairwise Alignment
10.6 Tools
10.6.1 Fasta
10.6.2 Blast
10.7 Multiple Sequence Alignment
10.8 Multiple Alignment Tools.
10.9 Summary
10.10 Key words
10.11 Answer to checkup your Progress
10.12 Self Assessment Questions
10.13 Further Readings.
13
10.1 INTRODUCTION
This chapter introduces the concepts of sequence identify, similarity and

hourology as they apply to the comparision of two sequences, be they protein, DNA or
RNA. Pairwise comparision is a fundamental process in sequence analysis. However
analysis of groups of requences that form gene families require the ability to make
connections between more than 2 members of the group. Multiple sequence Alignment
facilitates the elucidation of biologically significant acolits.
10.2 SEQUENCE SIMILARITY SEARCHES PAIRWISE ALIGNMENT

TECHNIQUES :
Pairwise comparision is a fundamental process in sequence analysis. It find out
relationship based on sequence properties rather than on simple interrogation of textual
annotation.
In order to identify an evolutionary relationship between newly determined
sequence and a known gene family, we need to access the extent of shared similarity.
Family Relationship are important because they allow us to find some order in apparent
chaos that constitutes the genome.
Comparing two sequences:-
Line up the sequences against each other & insert additional characters to being
the two strings into vertical alignment.
Unaligned
Sequence 1 (Query) AGGVLIIQVG

Sequence 2 (Subject)
A G G V L I QV G
Aligned
Sequence 1 (Query) AGGVLIIQVG
Sequence 2 (Subject) AGGVLIQVG
We could score the alignment by counting how many positions match identically
at each position. Here unaligned score is 6 and aligned score is 9.
The process of alignment can be measured in terms of no of gaps introduced and

the, of mismatches remaining in the alignment. A metric relating such parameters
represents the distance between two sequences.
We could simply maximise the number of identical matches by inserting gaps in
an unrestricted manner. Scoring penalties are introduced to minimize the number of gaps
and extension penalties are introduced when a gap has to be extended. The total
14
alignment-score is then a function of the identify between aligned residues and the gap
penalties incurred.
10.3 SCORING MATRICES
Scoring matrices have been devised that weight matches between non-identical
residences. They are
1. The Dayhoff Mutation Data Matrix.

2. The BLOSOM Matrices.
10.3.1. The Dayhoff Mutation Data Matrix
It is based on the concept of the point Accepted Mutation (PAM). An evolutionary
distance of 250 PAMS gives similarity scores equivalent to 20% matches remaining
between two sequences. This is the Twilight zone. Hence PAM 250 is often used as the
default matrix in comparision program.
10.3.2. The BLOSUM MATRICES;- (Blocks Substitution Matrix)
Matrices based on Dayhoff model of evolutionary rates are of limited value

because their substitution rates are derived from alignments of sequences that are atleast
85% identical. But the most common task in sequence analysis is the detection of more
distant relationships.
BLOSUM matrices overcome these limitation. It is a set of substitution metrics

from blocks of aligned sequences in the BLOCKS database.
Sequences clustered at greater than or equal to 80% identity are used to generate
the BLOSOM 80 metric. Those in the 62% or greater cluster contribute to the BLOSUM
62 matrix
Statistical values are used to indicate the level of confidence that should be
attached to an alignment. For pairwise alignments, these are usually formulated as
probability (P) values or expected frequency (E) values.
Alignments are models that reflect different biological perspectives. One model is
no more right or wrong than another.
Two general approaches consider similarity
a) Across the full extent of sequences which is called Global alignment. It uses
Needle man and Wunsch algorithm.
b) Across only part of sequences which is called local alignment. It uses smith-
waterman algorithm.
Needleman and Wunsch and smith-waterman algorithm exploit dynamic

programming.
15
10.4 DYNAMIC PROGRAMMING
This is a programming technique in which we build a solution to a problem by

solving smaller, but similar sub-problems.
Dynamic programming involves the technique of back tracking and testing

different paths to high scoring alignments, guided by the various parameters (gap
penalties etc) available to the algorithm. The best of all paths is then selected as the final
alignment.
What is Pairwise Sequence Analysis?

________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
Check your progress 2:
Mention few scoring matrices used for sequence Alignment Techniques.

________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
10.5 COMPARATIVE ANALYSIS BY PAIRWISE ALIGNMENT
Performing a comparision of 1 sequence against a database of many thousands is

an extension of pairwise alignment. To perform Needle man and Wunsch (or) smith-
waterman, alignment is practicable for small numbers of sequences. But for larger
database searches the methods become time consuming. Speed of Execution is certainly
an issue for database searching. Speed depends on the length of query sequence & on the
size of the database searched. The Fast A, and BLAST programs are the local similarity
search methods that concentrate on finding short identical matches, which may contribute
to a total match using implementations that address issues of execution speed.
10.6 TOOLS
The Fast A and BLAST programs are local similarity search methods that
concentrate on finding short identical matches, which may contribute to a total match.
10.6.1. Fast A
The Fast A algorithm is based on the idea of identifying short words or K-tuples,
common to both sequences under comparison. K-tuples sizes of 1 0r 2 residues are used
in protein searches, while larger K-tuples (upto 6 bases) are used in DNA searches.
16
Fast A uses a heuristic approach to join K-tuples that lie close together on the
same diagonal. The regions formed in this way contain mismatches lying between
matching K-tuples. If a significant number of matches is found, Fast A uses a dynamic
programming algorithm to compute gapped alignments that incorporate the ungapped
regions.
What is BLAST?
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
10.6.2. BLAST (Basic Local Alignment Search Toot)
It has become largely popular, because implementations of it have been very

efficient. The important concept in this algorithm is that of the segment pair. Given two
sequences, a segment pair is defined as a pair of sub sequences of the same length that
form an ungapped alignment. BLAST calculates all segment pairs between the query and
the database sequences, above a scoring threshold. The algorithm searches for fixed
length hits, which are there extended until certain threshold parameters are achieved. The
resulting High-Scoring Pairs (HSPs) form the basis of the ungapped alignments that
characterize BLAST output.
Gapped BLAST
BLAST algorithm is modified. This new algorithm generate gapped alignments. It

seeks only one, rather than all, ungapped alignments that make up a significant match and
hence speeds the initial database search. Dynamic programming is used to extend a
central pair of aligned residues in both directions to yield the final gapped alignment. As
it drops the requirement to find all ungapped alignments independently, the new
algorithm is 3 times faster than its predecessor.
10.7 MULTIPLE SEQUENCE ALIGNMENT
Analysis of groups of sequence is that form gene families requires the ability to
make connections between more than two family members. Multiple alignments are used
to reveal conserved family characteristics. Multiple alignments are simply models. There
is nothing inherently correct or incorrect about a particular alignment. The important
point is whether the model accurately reflects known biological data. Sequence and
structure based alignments are both imperfect models, since neither can reflect all levels
of biological informations. Both approaches are valid representations of particular aspects
of biology and neither should therefore be considered to represent some ultimate truth or
gold standard.
A multiple alignment can be defined as a 2D table in which the rows represent
individual sequences, and the columns the residence positions. Sequences are laid into
this grid in such a manner that
17
a) The relative positioning of residences within any one sequence is preserved and
b) Similar residues in all the sequences are brought into vertical register.
We call the residue position in an unaligned sequence, the absolute position, while the
aligned residue position is termed as relative position.
The time taken to compute an alignment rises exponentially with the number of
sequences to be aligned. Various methods have been developed that use heuristics to
reduce the time to find good (not necessarily optimal) alignments. Some approaches
combine dynamic programming with heuristics. Such techniques include aligning all
pairs of sequences, aligning each sequence with one specific sequence, aligning
sequences in arbitrary order or aligning sequences following the branching order of a
Phylogenetic tree.
Manual methods are often dismissed as being subjective. However the results of
automatic alignment programs almost invariably require manual polishing, and hence
alignment editors have become essential tools.
Simultaneous multiple alignment methods align all sequences within a set at once,
and hence are very time consuming. They work best on small sets of short sequences.
Progressive multiple alignment methods align sequences in pairs, following the

branching order of a family tree. The best known program is clustal Similar sequences are
aligned first and more distantly related sequences are added later. By exploiting likely
evolutionary relationships, such methods can handle more realistic data/sets in a timely &
cost-effective manner.
10.8 MULTIPLE ALIGNMENT TOOLS.
Clustal uses the positioning of gaps in closely related sequences to guide the insertion
section of gaps into those that are more distant. Similarly information compiled during
the alignment process about the variability of the most similar sequences is used to help
vary gap penalties on a residue & position specific basis.
There are numerous alignment databases accessible via the web. These result from
different approaches e.g. The application of automated methods to cluster the primary
sequence resources into families or from endeavors to produce gene family
discriminators for inclusion in secondary databases.
Alignments produced by purely automatic methods should be handled with care

especially in cases where sequence similarity is low. They often result in over-zealous
gap insertion and can produce misalignments.
Various computational techniques have evolved to search primary sequence databases

using alignment based data structures. A recent hybrid approach, incorporating elements
of both pairwise and multiple alignment methods, is position specific iterative or PSI-
18
BLAST. Although fast to run, it has the disadvantage that the automated iterative search
may degenerate and lead to profile dilution.
What is Clustal W?
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
10.9 SUMMARY:
The simplest way to compare 2 sequences is to align them by inserting gap

characters
To being them into vertical register.
PAM 250 is often used as the default matrix in comparison program.
BLOSOM matrices detect distant similarities more reliably than the Dayhoff
matrices.
Global alignments consider similarity across the full extent of sequences.
Local alignments consider similarity across only parts of the sequences.
Needleman & Wunsch algorithm is used for Global alignment.
Smith-waterman algorithm is used for Local alignment.
Fast A and BLAST are local similarity search methods that concentrate or finding
short identical matches, which may contribute to a total match.
10.10 KEY WORDS:
Algorithm:
A logical sequence of steps by which a task can be performed.
Alignment:
The result of a comparison of two or more gene or protein sequence in order to

determine their degree of base or amino acid similarity. Sequence alignments are used to
determine the similarity, homology, function or other degree of relatedness between 2 or
more genes or gene products.
Alignment score:
An algorithmically computed score based on the number of matches,
substitutions, insertions and deletions within an alignment. Scores for matches and
19
substitutions are derived from a scoring matrix such as the PAM and BLOSOM matrices
for proteins, and affine gap penalties suitable for the matrix are chosen. Alignment scores
are in log odd units, often bit units. Higher scores denote better alignments.
10.11 ANSWER TO CHECK YOUR PROGRESS:
(1) Pairwise comparision is a fundamental process in sequence analysis. It find out

relationship based on sequence properties rather than on simple interrogation of
textual annotation.
(2) 1. The Dayhoff Mutation Data Matrix.
2. The BLOSOM matrices.
(3) The important concept in this algorithm is that of the segment pair. Given two
sequences, a segement pair is defined as a pair of sub sequences of the same length
that form an ungapped alignment. BLAST calculates all segment pairs between the
query and the database sequences, above a scoring threshold.
(4) Clustal W is a tool used for Multiple Sequence Alignment Technique.
10.12 SELF ASSESSMENT QUESTIONS:
1. How will you compose two sequences?

2. What is the importance of multiple sequences analysis?
10.13 FURTHER READINGS:
1. Introduction to Bio-Informatics Attwood.

2. Bio-Informatics-Sequences & Genomes David.W.Mount.
3. Bio-.Informatics methods & Applications - S.C.Rastogi.
4.Bio-Informatics C.S.V. Murthy.
20
UNIT-XI
GENOME ANALYSIS
OBJECTIVE
After going through this unit, you should be able to understand.
Human Genome Project

Yeast Genome database
BACs
OMIM
KEGG
TIGR
MBGD
STRUCTURE:
11.1 Introduction
11.2 The Human Genome Project HGP
11.3 Yeast Genome Database
11.4 BACs
11.5 OMIM
11.6 KEGG Database
11.7 TIGR
11.8 MBGD
11.9 Summary
11.10 Key words
11.11 Answer to check-up your Progress
11.12 Self assessment Questions
11.13 Further Readings.
21
11.1 INTRODUCTION
This chapter introduces the history, goal & accomplishments of Human Genome
project is a range of specialist genome information resources like KEGG, TIGR,
MBGD,OMIM,YAC,BACs.
11.2 THE HUMAN GENOME PROJECT (HGP)
Begun formally in 1990, the U.S. Human Genome Project was a 13-year effort
co-ordinate by the U.S. Department of Energy and the National Institutes of Health. The
project originally was planned to last 15 years, but rapid technological advances
accelerated the completion date to 2003. Project goals were to
Identify all the approximately 20,000 - 25,000 genes in human DNA,

Determine the sequences of the 3 billion chemical base pairs that make up
human DNA,
Store this information in databases,
Improve tools for data analysis,
Transfer related technologies to the private sector, and
Address the ethical, legal, and social issues (ELSI) that may arise from the
Project.
To help achieve these goals, researchers also studied the genetic makeup of several
nonhuman organisms. These include the common human gut bacterium Escherichia coli,
the fruit fly, and the laboratory mouse.
A unique aspect of the U.S. Human Genome Project is that it was the first large
scientific undertaking to address potential ELSI implications arising from project data.
Sequence and analysis of the human genome working draft was published in February
2001 and April 2003 issues of Nature and Science.
22
Human Genome Project Goals and Completion Dates
Area HGP Goal Standard Achieved Date Achieved

Genetic Map 2-to 5-cM resolution 1-cM resolution September 1994
map (600-1,500 map (3,000
markers) markers)
Physical Map 30,000 STSs 52,000 STSs October 1998
DNA Sequence 95% of gene- 99% of gene- April 2003
containing part Containing part of
of human sequence human sequence
finished to 99.99% finished to 99.99%
accuracy accuracy
Capacity and Sequence 500 Mb/ Sequence > 1,400 November 2002
Cost of Finished Year at < $0.25 Mb/year at <$0.09
Sequence Per finished base Per finished SNPs
Human 100,000 mapped 3.7 million mapped February 2003

Sequence human SNPs human SNPs
Variation
Gene Full-length human 15,000 full-length March 2003
Indentification cDNAs human cDNAs
Model Complete genome Finished genome April 2003
Organisms Sequences of sequences of E.coli,
E.coil, S. cerevisiae, S. cerevisiae,
C. elegans, C. elegans,
D. melanogaster D. melanogaster,
Pluswhole-
genomegenome
Drafts of several
Others,including
C.briggsae, D.
P.seudoobscura,
mouse and rat.
Functional Develop genomic- High-throughput 1994
Analysis scale technologies Oligonucleotide
Synthesis
DNA microarrays 1996
Eukaryotic, whole- 1999
Genome knockouts
(yeast)
Scale-up of two- 2002
hybrid system for
Protioninteraction
23
11.3 YEAST GENOME DATABASE
Genetics and Molecular Biology of the Yeast Saccharomyces cerevisiae
The yeast Saccharomyces cerevisiae is clearly the most ideal eukaryotic microorganism
for biological studies. The awesome power of yeast genetics has become legendary and
is the envy of those who work with higher eukaryotes. The complete sequence of its
genome has proved to be extremely useful as a reference to wards the sequences of
human and other higher eukaryotic genes. Furthermore, the ease of genetic manipulation
of yeast allows its use for conveniently analyzing and functionally dissecting gene
products from other eukaryotes.
The Yeast Genome
S.Cerevisiae contains a haploid set of 16 well-characterized chromosomes, ranging in

size from 200to2,200 kb. The total sequence of chromosomal DNA, constituting 12,052
kb, was released in April, 1996. A total of 6,183 open-reading frames (ORF) of over 100
amino acids long were reported, and approximately 5,800 of them were predicated to
correspond to actual protein-coding genes. A large number of ORFs were predicted by
considering shorter proteins. In contrast to the genomes of multicellular organisms, the
yeast genome is highly compact, with genes representing 72% of the total sequence. The
average size of yeast genes is 1.45 kb, or 483 codons, with a range from 40 to 4,910
codons. A total of 3.8% of the ORF contain introns. Approximately 30% of the genes
already have been characterized experimentally. Of the remaining 70% with unknown
function, approximately one half either contain a motif of a characterized class of
proteins or correspond to genes encoding proteins that are structurally related to
functionally characterized gene products from yeast or from other organisms.
Genomic analysis
Many diverse studies require the determination of the abundance of large numbers of
specific DNA or RNA molecules in complex mixtures, including, for example, the
determination of the changes in mRNA levels of many genes. While a number of
techniques have been used to estimate the relative abundance of two or more sets of
mRNA, such as differential screening of cDNA libraries, subtractive hybridization, and
differential display, far more superior methods have been recently developed that are
particularly amenable to organisms whose entire genocome sequences are known, such as
S.cerevusuae. It is now practicable to investigate changes of mRNA levels of all yeast
ORFs in one experiment.
The following procedures have been successfully used for determining mRNA levels in
yeast: (i) the DNA Microarray System; (ii) the Oligonucleotide Microarray System; (iii)
the Low-density DNA Array System; and (iv) the kRT-PCR System.
24
Inheritance ____ Mendelian ____ ________________Non-Mendelian _________
Nucleic acid ______Double-stranded DNA________ ___Double stranded RNA__
Location ______Nucleus _____ __________ Cytoplasm___________________
Genetic determinant
Chromosomes 2-mm Mitochondrial RNA Viruses
Relative amount 85% Plasmid DNA L-A M L-BC T W
Number of copies 2 sets of 16
5% 10% 80% 10% 9% 0.5 05%
Size (Kb) 13,500(200-2,200) 60-100 -50(8-130) 103 170 150 10 10
Deficiencies in llokk
6.318 70.76 4.576 1.8 4.6 2.7 2.25
mutants All kinds
Killer oxin None
YFG1+ None Cytodiromes
Wild-type YFG1
yfg1-1
Mutant or variant Cir+ P+ KIL-k1
Ciro P- KIL-0
The genome of a diploid cell of S. cerevisiae. A wild-type chromosomal gene is

depicted as YFGI (Your Favorite Gene) and the mutation as yfg1-1.
Analyses with Yeast Systems
The accessibility of the yeast genome for genetic manipulation and the available
techniques to introduce exogenous DNA into yeast cells has led to the development of
methods for analyzing and preparing DNA and proteins not only from yeast itself, but
also from other organisms. For example, many mammalian homologs of yeast genes have
been cloned by using heterologous cDNA expression libraries in yeast expression vectors.
Also, yeast is being used to investigate the detailed functions of heterologous proteins,
such as mammalian transcription factors and nuclear hormone receptor. In fact, like
E.coli, yeast has become a standard microorganism for carrying out special tasks, some of
which are described in this section.
Yeast Artificial Chromosomes (YACs)
A yeast artificial chromosome (short YAC) is a vector used to clone large DNA
fragments (larger than 100 kb and up to 3000 kb). It is an artificially constructed
chromosome and contains the telomeric, centromeric, and replication origin sequences
needed for replication and preservation in yeast cells. Built using an initial circular
plasmid, they are linearised by using restriction enzymes, and then DNA ligase can add a
sequence or gene of interest within the linear molecule by the use of cohesive ends. They
were first described in 1983 by Murray & Szostack.
25
YACs are extremely useful as one can get eukaryotic protein products with
posttranslational modifications as yeasts are themselves eukaryotic cells, however YACs
have been found to be more unstable than BACs, producing chimeric effects. Before the
advent of the Human Genome Project, YACs and BACs were used to map sections of
DNA of interest when hunting for specific genes.
Mention the goal of HGP.

________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
What are all the various procedures used for determining mRNA levels in yeast?
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
11.4 Bacterial artificial chromosome (BAC)
A bacterial artificial chromosome (BAC) is a DNA construct, based on a fertility

plasmid (or F-plasmid), used for transforming and cloning in bacteria, usually E.coli. F-
plasmids play a crucial role because they contain partition genes that promote the even
distribution of plasmids after bacterial cell division. The bacterial artificial chromosomes
usual insert size is 150 kbp, with a range from 100 to 300 kbp. A similar cloning vector,
called a PAC has also been produced from the bacterial P1-plasmid.
BAC are often used to sequence the genetic code of organisms in genome projects, for
example the Human Genome Project. A short piece of the organisms DNA is amplified
as an insert in BACs, and then sequenced. Finally, the sequenced parts are rearranged in
silicon, resulting in the genomic sequence of the organism.
Common gene components in BACs
oriS, repE-F For plasmid replication and regulation of copy number.

ParA and ParB For Partitioning F plasmid DNA to daughter cells during division and
ensures stable maintenance of the BAC.
A selectable markers for antibiotic resistance, some BACs also have lacZ at the cloning
site for blue/white selection.
T7 & Sp6 phage promoters for transcription of inserted genes.
Contribution to models of disease
26
BACs are now being utilized to a greater extent in modeling genetic diseases, often
alongside transgenic mice. BACs have been useful in this field as complex genes may
have several regulatory sequences upstream of the encoding sequence, including various
promoter sequences that will govern a genes expression level. BACs have been used to
some degree of success with mice when studing neurological diseases such as
Alzheimers disease or as in the case of aneuploidy associated with Down syndrome.
There have also been instances when they have been used to study specific oncogenes
associated with cancers. They are transferred over to these genetic disease models by
electroporation/transformation, transfection with a suitable virus or microinjection. BACs
can also be utilized to detect genes or large sequences of interest and then used to map
them onto the human chromosome using BAC arrays. BACs are preferred for these kind
of genetic studies because they accommodate much larger sequences without the risk of
rearrangement, and are therefore more stable than other types of cloning vectors.
11.5 OMIMTM - ONLINE MENDELIAN INHERITANCE IN MANTM
The Mendelian Inheritance in Man Project is a database that catalogues all the known
disease with a genetic component, and when possible-links them to the relevant genes
in the human genome and provides references for further research and tools for genomic
analysis of a catalogued gene.
Versions
I t is available as a book name after the project, and it is currently in its 12th edition. The
online version is called Online Mendelian Inheritance in ManTM (OMIMTM), which can
be accessed with the Entrez database searcher of the National Library of Medicine and is
part of the NCBI project Education.
Collection Process
The information in this database is collected and processed under the leadership of
Dr.Victor A. Mckusick at johns Hopkins University, assisted by a team of science writers
and editors. Relevant articles are identified, discussed and written up in the relevant
entries in the MIM database.
The MIM code
Every disease and gene is assigned a six digit number of which the first number classifies
the method of inheritance.
If the initial digit is 1, the trait is deemed l autosomadominant; if 2, autosomal recessive;

if 3, X-linked. Wherever a trait defined in this dictionary has a MIM number, the number
from the 12th edition of MIM, is given in square brackets with or without an asterisk as
appropriate e.g., Pelizaeus-Merzbacher disease [MIM*169500] is a well-established,
autosomal, dominant, mendelian disorder.
27
First Range of MIM Method of Inheritance
Digit Codes
1 100000-199999 Autosomal dominant loci or phenotypes (created before
May 15,1994)
2 200000-299999 Autosomal recessive loci or phenotypes (created before
May 15,1994)
3 300000-399999 X-linked loci or phenotypes
4 400000-499999 Y-linked loci or phenotypes
5 500000-599999 Mitochondrial loci or phenotypes
6 600000- Autosomal loci or phenotypes (created after May 15,1994)
Symbols:
Representation of the symbols preceding a MIM number:

(asterisk (*), number (#), plus (+), percent (%), caret (^)
An asterisk (*) before an entry number indicates a gene of known sequence.
A number symbol (#) before an entry number indicates that it is a descriptive entry,
usually of a phenotype.
A plus sign (+) before an entry number indicates that the entry contains the description of
a gene of known sequence and a phenotype.
A percent sign (%) before an entry number indicates that the entry describes a confirmed
mendelian phenotype or phenotypic locus for which the underlying molecular basis is not
known.
No symbol before an entry number generally indicates a description of a phenotype for

which the mendelian basis, although suspected, has not been clearly established or that
the separateness of this phenotype from that in another entry is unclear.
A caret symbol (^) before an entry number means the entry no longer exists because it
was removed from the database or moved to another entry as indicated.
KEGG PATHWAY Database
KEGG PATHWAY Database records networks of molecular interactions in the cells,

and variants of them specific to particular organisms.
28
What are BACs?

________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
11.6 KEGG DATABASE
The KEGG database was initiated by the Japanese human genome programme in
1995 and it part of the Kyoto Encyclopedia of Genes and Genomes. According to the
developers they consider KEGG to be a computer representation of the biological
system. The KEGG database can be utilized for modeling and simulation, browsing and
retrieval of data. It is a part of the systems biology approach.
KEGG maintains four main databases.
* KEGG Pathway
TIGRs Genome Projects are a collection of curated databases containing DNA and
Protein sequence, gene expression, cellular role, protein family, and taxonomic data for
microbes, plants and humans. The access to the data is facilitated by TIGRs Internet 2
high-speed research network connection which is supported in part by the National
Science Foundation under grant ANI-0333537. Anonymous FTP access to sequence data
is also provided. We can also access the following in this database.
nes
* KEGG Ligand
* KEGG BRITE
Databases
KEGG connects known information on molecular interaction networks, such as pathways

and complexes (this is the Pathway Database), information about genes and proteins
generated by genome projects (including the gene database) and information about
biochemical compounds and reaction (including compound and reaction databases).
These databases are different networks, known as the protein network, the gene universe
and the chemical universe respectively. There are efforts in progress to add to the
knowledge of KEGG, including information regarding ortholog clusters in the KO
(KEGG Orthology) database.
29
KEGG Pathways:
* Metabolism
* Gentic Information
Processing
* Environmental
Information
Processing
* Cellular Processes
* Human Diseases
* Drug Development
Ligand Database:
* Compound
* Drug
* Glycan
* Reaction
* RPAIR
* Enzym
11.7 THE INSTITUTE FOR GENOMIC RESEARCH (TIGR)
Comprehensive Microbial Resource
The Comprehensive Microbial Resource (CMR) is a free website used to display

information on all of the publicly available, complete prokaryotic genomes. In addition to
the convenience of having all of the organisms on a single website, common data types
across all genomes in the CMR searches more meaningful and cross genome analysis
highlight differences and Similarties between the genomes. A CMR Mirror site
maintained by the Genome Encyclopedia of Microbes (GEM) in Korea is also available.
Plant Genomics
The TIGR Castor bean Database Provides links to the castor bean genome project at
TIGR and includes sequencing and assembly of a 4X draft of the ~400 Mbp genome
using a whole genome shotgun strategy and, ~50,000 ESTs from different tissues to aid in
gene discovery and annotation. This project is funded by the NIAID-NIH, through the
Microbial Genome Sequencing Center at TIGR.
30
TIGR Plant Transcript Assemblies represent clustered, assemblies of all transcripts for
~ 140 plant species and can be accessed here.
The TIGR-NCSU Phytophthora infestans Mitochondrial Genome Haplotyping

Database, sponsored by USDA, can be accessed here.
The Comprehensive Phytopathogen Genome Resource provides a centralized resource
for accessing genomic data for plant pathogens including viral, bacterial, fungal,
Oomycete, and nematodes and can be accessed here.
The TIGR Wheat Genome Database

The TIGR wheat Genome Database provides access to wheat genomic and EST
sequences along with other bioinformatics analyses such as alignments to the rice
genome.
The TIGR Arubidopsis thaliana Database provides access to genomic sequence data
and annotation generated at TIGR and assemblics of Arabidopsis ESTs from world-wide
sequencing projects.
The TIGR Rice Database provides links to the USDA.- CSREES/NSF/DOE-funded

Rice genome project at TIGR and includes sequence data, annotation, and links to the
Oryza sativa Gene Index.
Potato Functional Genomics Project provides links to the NSF-funded potato genome
project at TIGR and includes sequence data, annotation, and links to the Solamum
tuberosum Gene Index.
The TIGR Maize Database Provides links to the NSF-funded Consortium for Maize
Genomics project and includes sequence, assembly and annotation data and links to the
Maize Gene Index.
TIGR Plant Repeat Databases is a collection of repetitive sequences for 12 plant genera
and four plant families.
The TIGR Loblolly Pine Functional Genomics Project , in collaboration with the
Institute of Paper Science and Technology and funded by the National Science
Foundation, can be accessed here.
The TIGR Medicago truncatula Database provides access to annotations generated at

TIGR and Medicago EST and BAC sequences from world-wide sequencing projects.
here.
Maize Oligonucleotide Array Project produce and distribute to the research community
high density microarrays for the maize genome. This site will contain information on
project goals, participants, array availability, and data access.
31
Rice Oligonucleotide Array project produce and distribute to the research community a
whole genome microarray for the rice genome and link this information to the rice
genome.
Arabidopsis Array Project provide experimental validation for Arabidopsis thaliana

gene predictions and to begin to assign functional roles to genes using DNA microarray
technology as a key tool.
What is TIGR?
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
Parasite Projects
The TIGR Perkinsus marinus genome database

The TIGR Theileria parya genome database, sponsored by TIGR and the International
Livestock Research Institute in Nairobi, Kenya.
The TIGR Parasites Database provides links to TIGR sequencing projects completed
and underway as well as links to related world-wide sequencing efforts.
Babesua bivus Toxoplasoma gondii

Brugia malayi Trichomonas
Entamoeba vaginalis
histolytica Trypanosoma brucei
plasmodium Trypanosoma cruzi
falciparum Schistosoma
Plasmodium vivax mansoni
Plasmodium yoelii
Other Eukaryotic Projects
The TIGR Tetrahymena the ermophila genome database provides links to the NSF
and NIH-NIGMS funded Tetrahymena genome project at TIGR
The TIGR Vector Genomics Database provides links to TIGRs sequencing efforts in
area of Aedes aegypti vector genomics..
The TIGR-NCSU Phytophthora infestans Mitochnodrial Genome Happlotyping

Database, sponsored by USDA, can be accessed here.
The TIGR-USDA Bracovirus Comparative Genomics Database, sponsored by the

NSF and USDA.
32
11.8 MBGD
Microbial Genome Database for Comparative Analysis
MBGD is a workbench system for comparative analysis of completely sequenced

microbial genomes. The heart of MBGD functions is to create othologous or homologous
gene cluster table. For this purpose, similarities between all genes are precomputed and
stored into the database, in addition to the annotations of genes such as function
categories that were assigned by original authors and motifs that were found in the
translated sequence. Using these homology data, MBGD dynamically creates orthologous
gene cluster table. Users can change a set of organisms or cutoff parameters to create
their own orthologous grouping. Based on this cluster table, users can further analyze
multiple genomes from various points of view with the functions such as global map
comparison, local map comparison, multiple sequence alignment and Phylogenetic tree
construction.
More than 300 genomic sequences have been determined to date, and the number of
completed sequences continues to grow. Extracting useful information from such a
growing number of genomes is a major challenge in comparative genomics. Interestingly,
many of the completed genomic sequences are closely related to each other, of the 293
genomic sequences available at the end of 2005, the number of unique species (for which
at least one genome sequence was determined) is 211, and the number of unique genera is
only 135. It is important to conduct comparative analyses not only of distantly related
genomes, but also of closely related genomes, since we can extract different types of
information about biological functions and evolutionary processes from comparisons of
genomes at different evolutionary distances.
What is MBGD?
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
MBGD is a microbial genome database that provides a platform for large-scale

comparative genome analysis based on comprehensive ortholog classification Unlike
COG, TIGRFAMs and other databases of orthologous groups constructed with curation
processes, MBGD is comprehensive and routinely updated. Unlike OrthoMCL-DB, IMG
and other databases of orthologous groups constructed by automated procedures, MBGD
allows users to classify genes dynamically. The key features of MBGD derive from an
efficient clustering algorithm named DomClust, which is a hierarchical clustering
33
algorithm for constructing ortholog groups at the domain (rather than gene) level from
precomputed all-against-all similarity relationships. With this algorithm, MBGD not only
provides the orthologous groups among the latest genomic data available, but also allows
users to create their own ortholog groups using a specified set of organisms. The latter
feature is especially useful when the users interest is focused on some taxonomically
related organisms; in fact, MBGD is most effectively used when an appropriate number
of genomes are selected. However, in the previous version, users could only choose
published genomes whose sequences were already available in MBGD.
With the growing amount of the microbial genomic information in mind, we have
started a new service called My MBGD, which allows users to add their own genome
sequences to MBGD for the purpose of finding orthologous relationships among the
newly added genomes and the existing genomes. Furthermore, in order to facilitate
comparisons of closely related genomes, we have also enhanced the interface the
interface of pairwise comparison using the CGAT interface, which is a Java applet for
displaying genome and alignment viewers.
MBGD Function
11.9 SUMMARY
HGP was a 13 year effort which identifiers approximately 20,000-25,000 genes in

human DNA.
The yeast Saccharomyces cerevisiae is clearly the most identical eukaryotic
microorganism for biological studies.
BACs are now being utilized to a greater extent in modeling genetic diseases
often alongside transgenic mice.
OMIM, KEGG,TIGR are sepcialised genome databases.
11.10 KEY WORDS
Bacterial Artificial chromosome (BAC):
Cloning vector that can incorporali large fragments of DNA.
Base pair:
A pair of nitrogenores bases (a Purina and a pyrimidine) held together by

hydrogen bonds, that form the core of DNA and RNA i.e., the A: T, G: C and A: U
iterations.
Expressed sequence Tag (EST):
A partial sequence of a clone, randomly selected from cDNA library and used to
identity genes expressed in a particular tissue. ESTs are used extensively in projects to
map the human genome.
34
Flat file:
A human readable data file in a convenient form for interchange of database

information. Flat files may be created as output from relational databases, in a format
suitable for loading into other databases.
Genome:
All the genetic mativial in the chromosomes of a particular organism. Its size is
generally given as its total no. of base pairs.
Genome Projects:
Imitative to map e sequence the entire genomes of particular organisms. The first
complete eukaryotic genome to have been sequenced is that of the yeast s. corrosive.
Kilo base (kb):
Unit of length of DNA fragments equal to 1000 Nucleotides.
Mega base (Mb):
Units of length of DNA fragments equal to 1 million Nucleotides.
Micro arrays:
An array is an arrangement of points in rows and columns. A Micro array is an

extension of the concept and constitutes a very small arrangement of many points in rows
and columns. The term refers to a series of high density DNA spots bound to some solid
support.
MMDB:
Molecular Modeling Database. A taxonomy assigned database of PDB files, and

related information.
NCBI:
The US National center for Biotechnology information.
NIH:
The US National Institute of Health.
35
11.11 ANSWER TO CHECK-UP YOUR PROGRESS
1. Project goals were to
identify all the approximately 20,000-25,000 genes in human DNA,

determine the sequences of the 3 billion chemical base pairs that make up human
DNA,
Store this information in databases,
Improve tools for data analysis,
Transfer related technologies to the private sector, and
Address the ethical, legal, and social issues (ELSI) that may arise from the
project.
2. The following procedures have successfully used for determining mRNA levels in
yeast;(1) the DNA Microarray System; (ii) the Oligonuclet-otide Microarray System;
(iii) the Low-density DNA Array System; and (iv) kRT-PCR System.
3. A Bacterial artificial chromosome (BAC) is a DNA construct, based on a fertility

plasmid (or F-plasmid), used for transforming and cloning in bacteria, usually E.coli.
4. TIGRs Genome Projects are a collection of curated databases containing DNA and
protein sequence, gene expression, cellular role, protein family, and taxonomic data for
microbes, plants and humans.
5. MBGD is a workbench system for comparative analysis of completely sequences

microbial genomes. The heart if MBGD function is to create orthologous or homo
logous gene cluster table.
11.12 SELF ASSESSMENT QUESTIONS
1. Mention the significance of KEGG database?

2. Explain about OMIM.
11.13 FURTHER READINGS
1. Introduction to Bio-Informatics-Attwood.
2. Bio-Informatics Sequence and Genomes David, W.Mount.
3. Bio-Informatics methods is Applications S.C. Rastogi.
4. Bio-Informatics C.S.V. Murthy.
36
Homology relationships
Organism selection
Parameter seeting Ortholog cluster table
37

Rese Rach

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Rese Rach

Hochgeladen von

Copyright:

Verfügbare Formate

UNIT-IX

Protein Sequence databases.

9.0 Learning Objective

The aim of this chapter is to provide on introduction to a range of biological

9.2 BIOLOGICAL DATABASES

Databases are effectively electronic filling cabinets, a convenient and efficient

9.3 CLASSIFICATION OF BIOLOGICAL DATABASES:-

2) Manner of data storage Flat files (or) Tables in

9.3.1 CLASSIFICATION OF BIOLOGICAL DATABASES:-

9.4. PRIMARY SEQUENCE DATABASES:

a) Nucleic Otide Databases:

The principal DNA sequence databases are

They exchange data on daily basis to ensure comprehensive coverage at each of

EMBL is the DNA database which is maintained by the European Molecular

DDBJ, DNA Databank of Japan, began in 1986 is produced, Maintained and

9.5 PROTEIN SEQUENCE DATA BASES

- PIR (Protein Information Resource)

PIR 1 contains fully classified and annotated entries.

- Conceptual translations sequences

2. MIPS (Martin sried Institute for Protein Sequences)

SWISS-PROT is a Protein Sequence Database found out in 1986. It was produced

The database provide

- High level annotations.

Tr EMBL was created in 1996 as a computed Annotated supplement to SWISS-

IT HAS TWO MAIN SECTION AS FOLLOWS:

Contains entries that will eventually be incorporated into SWISS-PROT. They

Contains sequences that are not destined to be included in SWISS-PROT. These

- Bibliographic references Binding site

Check Your Progress1:

What are biological Databases?

Check Your Progress2:

Classify biological databases?

9.6 COMPOSITE PROTEIN SEQUENCE DATABASES

Composite database is a database that amalgamates a variety of different primary

I) NRDB A Non -Identical composite protein sequence Database

1) NRDB (Non-Redundant database)

It is build at NCBI. It is non-redundant but non identical This sophistic

OWL is a non-redundant database at the University of Leeds in collaboration

SWISS PROT GENBANK

Check Your Progress 3:

Define composite databases?

9.7 SECONDARY DATABASE (OR) PATTERN DATABASE

The type of information stored in each of the secondary databases is different.

The first secondary database developed is PROSITE. Which is maintained at the

PRINT is a database of protein motifs. It was fingerprints composed of more than

Within prints, motifs are encoded as ungapped, unweighted local alignments.

Blocks are a multiple motif databases. Motif or Blocks are created by

Another automatically derived tertiary resource, derived from BLOCKS and

It is a database of alignments of protein domain families. This database is based

9.8 STRUCTURE CLASSIFICATION DATABASES

The most important structural classification databases are.

1. SCOP (Structural classification of proteins)

SCOP has been constructed using a combination of manual inspection and

CATH is maintained at UCC. The resources is largely derived using automatic

9.9 SPECIALISED DATABASES

2. KEGG (Kyoto Encyclopedia of Genes & Genomes)

It is an effort to computerize current knowledge of molecular & cellular biology

3. SGD (Sacharromyces Genome Database)

SGD is an resource which brings together the information on the molecular

UniGene provide a transcript map by utilizing sets of non-redundant gene-

5. TDB (TIGR DATA BASE)

Check Your Progress 5:

9.11 KEY WORDS:

2. Briefly explain various Structural Classification Database.