Sie sind auf Seite 1von 25

Bioinformatics

Introduction, Scope and Methods

Supervised by
Dr. Yogita, Assistant Professor
National Institute of
Technology, Meghalaya

Presented by
Subhasree Majumder
T18CS005
M.Tech CSE 1st Year
Objective

• Introduction to Bioinformatics
• Application Areas
• Fundamentals of Cell, DNA and Proteins
• Central Dogma of Biology
• Biological databases, data formats
• Sequence Alignment
What is Bioinformatics
• Bioinformatics involves the technology that uses computational power for
storage, retrieval, manipulation and distribution of information related to
biological macromolecules such as DNA, RNA and proteins.
• In broader sense it is an interdisciplinary field which is the meeting point
of Biology, Statistics and Computer Sciences.
Applications and Subfield
Bioinformatics consists of two major subfields: the development of computational tools and databases and
the application of these tools and databases in generating biological knowledge in the areas of molecular
sequence analysis, molecular structural analysis, and molecular functional analysis
Biology for Engineers
Cell- The Fundamental Unit of Life.

• The Cytoplasm is the main matrix which holds the cell and all other related
cellular devices or organelles.

• The Nucleus holds the genetic material called DNA.

• Ribosomes are special organelles aiding in protein synthesis.

• Endoplasmic Reticulum are a connectivity network from nucleus to


cytoplasm and helps to transfer DNA material.

• Golgi Apparatus helps in packaging proteins.


Nature’s Most Common Macromolecular
Sequences: DNA, RNA and Proteins
Central Dogma of Molecular Biology
The cell contains the nucleus housing the DNA
material but how does this information gets
expressed as specific traits in individuals?

Theory of Central Dogma of Biology

• 4 base pairs in DNA -Adenine, Thymine, Cytosine,


Guanine (Uracil) .
• 20 amino acids. So we need at least 4*4*4= 64 i.e.
3 base pair combination to code for 20 proteins at
least.
• Each such group of 3 letters are called codon.
• Regions of codons coding for some protein is
called a gene.
Codon Amino Acid Chart

• UAA, UAG, UGA do not code


for any protein, so they are the
stop codons.
• AUG is the start codon and it
codes for Methionine.
Exemplary Bioinformatics
Methods
Open Reading Frame Search
1. Consider a hypothetical sequence:
CGCTACGTCTTACGCTGGAGCTCTCATGGATCGGTTCGGTAGGGCTCGATCACATCGCTAGCCAT

2. Divide the sequence into 6 different reading frames(+1, +2, +3, -1, -2 and -3). The first reading frame is obtained by
considering the sequence in words of 3.
FRAME +1: CGC TAC GTC TTA CGC TGG AGC TCT CAT GGA TCG GTT CGG TAG GGC TCG ATC ACA TCG CTA GCC AT

3. The second reading frame is formed after leaving the first nucleotide and then grouping the sequence into words of 3
nucleotides
FRAME +2: C GCT ACG TCT TAC GCT GGA GCT CTC ATG GAT CGG TTC GGT AGG GCT CGA TCA CAT CGC TAG CCA T

4.The third reading frame is formed after leaving the first 2 nucleotides and then grouping the sequence into words of 3
nucleotides
FRAME +3: CG CTA CGT CTT ACG CTG GAG CTC TCA TGG ATC GGT TCG GTA GGG CTC GAT CAC ATC GCT AGC CAT

5. The other 3 reading frames can be found only after finding the reverse complement.
Complement : GCGATGCAGAATGCGACCTCGAGAGTACCTAGCCAAGCCATCCCGAGCTAGTGTAGCGATCGGTA
Reverse complement: ATGGCTAGCGATGTGATCGAGCCCTACCGAACCGATCCATGAGAGCTCCAGCGTAAGACGTAGCG

6. Now same process as that of +1, +2 and +3 strands is repeated for -1, -2 and -3 strands with reverse complement
sequence
FRAME -1: ATG GCT AGC GAT GTG ATC GAG CCC TAC CGA ACC GAT CCA TGA GAG CTC CAG CGT AAG ACG TAG CG
FRAME -2: A TGG CTA GCG ATG TGA TCG AGC CCT ACC GAA CCG ATC CAT GAG AGC TCC AGC GTA AGA CGT AGC G
FRAME -3: AT GGC TAG CGA TGT GAT CGA GCC CTA CCG AAC CGA TCC ATG AGA GCT CCA GCG TAA GAC GTA GCG
Open Reading Frame Search
3. Now mark the start codon and stop codons in the reading frames

4. Identify the open reading frame (ORF) - sequence stretch beginning with a start codon and
ending in a stop codon.

5. Based on the amino acid table the peptide sequence is found


Biological Databases and File Format

1. Nucleotide sequence data is stored in 3 databases


EMBL - European Molecular Biology Laboratory
DDBJ - DNA Databank of Japan
GENBANK -National Centre for Biotechnology Information.

2. Protein sequence data is stored in


Uniprot – universal protein resource

3. Articles and Biological Literature Database


PubMed

4. Database Retrieval System


SRS and Entrez.
Biological Databases and File Format
1. FASTA File Format:
>margaret1.2
GGATCGAACTACGTTCACATTACGTCACATTG
2. Genbank File Format:
Sequence Alignment: Introduction
Why sequence alignment?
• Life started from a single cell and through the process of evolution it led to formation of different species and
life forms. But if we take any two species we might find some similarity and this sameness that speaks of
common ancestry, is called homology.
• In genetics, we can find homology in genetic sequences such as DNA or protein sequences. Homologous
groups can be divided into two groups:
• Similarities in structures indicate common functions. For example the functions of a protein is dictated by its
structure. When we say homologous sequences, we mean they diverged from the same ancestor. Microbial
genomic sequence projects have shown that homologous proteins have similar structures.
• So given two sequences, we try to find the best alignment among them to show that one might have resulted
into the other through mutation and hence conclude about they came from the same ancestor. For ex. Given
two sequences ATCGA and ATGA
ATCGA
AT_GA these could be a possible alignment among the two.
ATGA
What is the usefulness of sequence alignment?
• Suppose we know the gene causing obesity in mice. Can we find similar gene in humans? Since mice and humans
came from the same ancestor, this might be the case that we too have similar gene only mutated over time.
• Predicting protein structures of newly sequenced proteins.
• Calculating similarity between two species for example chimpanzees and humans.
How to decide the best alignment which might have been caused by nature?
Gaps/Indels
Here gaps are called indels i.e. insertions/deletions in sequence during evolution which has led to many functional
changes.
So indels get –ve score in alignment unlike substitution.

AACGT_AGAATC__
_TGCAGAGCC_GGA
Gap penalty can be calculated as follows:
• Opening a gap receives a penalty of d.
• Extending a gap receives a penalty of e.
• Total penalty = d+ (n-1)*e.

• Ex. GP =10, EP=0.5


• Then, H Q H G A
_____

Total Penalty => 10+(4)*(0.5) = 12


Scoring Matrices
• The simplest way of scoring is to assign 1 for a match and 0 for a mismatch. But unitary matrices aren’t
adequate because few substitutions are more prevalent in nature than others.
• For example a in amino acid sequence, a change from valine to an isoleucine is more likely than a change from
valine to aspartic acid due to their difference in chemical properties.
• So some substitutions are prevalent in nature, so given reward points while some are not, so they are
given penalty points in sequence alignment.
• PAM or Point Accepted Matrices are derived using substitution rates from real protein sequence alignment.
Alignment algorithms

Dynamic algorithm for pairwise sequence alignment ~Needleman-Wunsch Global Sequence


alignment algorithm

Dynamic problem works by finding the best alignment as sum of previous alignment and present
alignment. So suppose we want to align two sequences x and y.
• F(i,j) is the score of the best alignment between X 1…I and y 1…j.
• S(A,B) is the score of substitution of A with B, and d is gap penalty.
• 3 cases are possible Si Si __
Tj __ Tj

• F(i,j) = max {F(i-1,j-1) + S(xi,yj) ,


F(i-1,j) + d,
F(i,j-1) + d}

I/P ~ Seq1, Seq2

Parameters ~Scoring function for substitiution of residues and Gap score.

O/P ~optimal alignment of S1 and S2 that has maximal value.


Sequences are AAG and AGC
Linear gap penalty: -5.
Given a nucleotide substitution matrix
F(0,1) = F(0,0) + d = 0+(-5) = -5
….
F(0,3) = F(0,2) + d = -15 -5 -10

-5 2
-10
-15

F(1,1) = max((F(0,0) + S(A,A)), F(0,1)+d , F(1,0)+d)


=2
Or
We now trace back to the upper left.

AA G -
-AGC
AA G -
A-GC

Final alignment

Note: When we take -5 as the substitution we get


A gap, A A.

When we take 2 as the score, we get A A and for


the next A we take a gap.
Local vs Global Sequence Alignment
• In global alignment, two sequences to be aligned • Local alignment, on the other hand, does not
are assumed to be generally similar over their assume that the two sequences in question have
entire length. similarity over the entire length.
• Alignment is carried out from beginning to end • It only finds local regions with the highest level of
of both sequences to find the best possible similarity between the two sequences and aligns
alignment across the entire length between the these regions without regard for the alignment of
two sequences. the rest of the sequence regions.
• This method is more applicable for aligning two • This approach is more appropriate for aligning
closely related sequences of roughly the same divergent biological sequences containing only
length. a modules that are similar, which are referred to
• For divergent sequences and sequences of as domains or motifs.
variable lengths, this method may not be able to
generate optimal results because it fails to
recognize highly similar local regions between
the two sequences.
Local Alignment algorithms

Dynamic algorithm for pairwise sequence alignment ~Smith Waterman Local Sequence
alignment algorithm

It was discovered in the later years , that proteins might have similar functions but very different
structures with similarities found only in domain/motifs. Global alignment algorithm often ignores
such similar areas.

Smith Waterman came up with a different approach over the Needleman algorithm, by lower
bounding the function by 0 instead of going negative:

0
Note: Here we 2 things, since lower bound is 0, so no negative scores, also that stops the backtracking path from
starting always from bottom right.

Next, because of above we get local alignments also. Like 0->2 and 0->2->4

Traceback the largest value till one reaches 0. A G


AG
Also traceback the second best alignment. A
A

0
Thank You 

Das könnte Ihnen auch gefallen