Lva1 App6891 PDF

GENE PREDICTION
presented by
Rituparna Addy
Department of Biotechnology
Haldia Institute of Technology
Gene:
• A sequence of nucleotides coding for protein.
Central Dogma:
• Proposed in 1958 by Francis Crick.
• He postulated that all possible information
transferred, are not viable.
• He published a paper in 1970.
4 3 = 64 possible codons
CODONS:
• Discovered by Sydney Brenner and Francis Crick in
1961.
• In every triplet of nucleotides, each codon codes for
one amino acid in a protein.
1 2 3
DNA RNA PROTEIN PHENOTYPE
cDNA
1. TRANSCRIPTION
2. TRANSLATION
3. GENE EXPRESSION
4. REVERSE TRANSCRIPTION
DEfiniTION
• It is a prerequisite for detailed functional annotation
of genes and genomes.
• It can detect location of ORFs (Open Reading
Frames), structures of introns and exons.
• It describes all the genes computationally with near
100% accuracy.
• It can reduce the amount of experimental
verification work required.
TYPES
Homology-
Abinitio-based
based
• Abinitio- gene signals, intron splice, transcription

factor binding site, ribosomal binding site, poly-
adenylation site, triplet codon structure and gene
content.
• Homology- significant matches of query sequence
with sequence of known genes.
• Probabilistic models like Markov model or Hidden
Markov Models (HMMs).
Gene finding software/program
• It is organism-specific.
• It works best on genes that are reasonably similar to
a known gene detected previously.
• It finds protein coding regions far better than non-
coding regions.
• It can predict the most probable exons and
suboptimal exons.
• It is reasonably successful in finding genes in a
genome. But still it is imperfect!
Exons and Introns
• In eukaryotes, the gene is a combination of coding
segments (exons) that are interrupted by non-coding
segments (introns).
• Genes in prokaryotes are continuous. So
computational gene prediction is much easy than in
eukaryotes.
• Exons are interspersed with introns and typically
flanked by GT and AG.
INITIAL
• Exons INTERNAL
FINAL
SINGLE
Sequence signals
Start codon Stop codon
Genomic DNA
Transcription
pre-mRNA Cap- -Poly(A)
Splicing
mRNA Cap- -Poly(A)

Translation
Protein
exon intron
GT AG
Donor site Acceptor site

Exons are usually
shorter than introns. Splice sites
Prokaryotic gene prediction
• Gene prediction is easier in microbial genomes.
• Smaller genomes, high gene density, very few
repetitive sequence, more sequenced genomes.
• Start codon is ATG.
• Ribosomal binding site/Shine Dalgarno sequence.
Open reading frames
• A sequence defined by in-frame start and stop
codon, which in turn defines a putative amino acid
sequence.
• A genome of length n is comprised of (n/3) codons.
• Stop codons break genome into segments between
consecutive stop codons.
• The sub-segments of these that start from the Start
codon (ATG) are ORFs.
• DNA is translated in all six possible frames, three
frames forward and three reverse.
ATG TGA
Genomic Sequence
Open reading frame
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
Probabilistic models
• Statistical description of a gene.
• Markov Models & Hidden Markov Models.
• Used to distinguish oligonucleotide distributions in
the coding regions from those for non-coding
regions.
• Probability of distribution of nucleotides in DNA
sequence depends on the order k.
• Types of order- zero, first and second.
• Order , gene can predicted more accurately.
ZERO FIRST SECOND
Each base occurs Occurrence depends on Preceding of two bases to
independently with a the base preceding it determine which base
given probability follows
Non-coding sequences --- Coding sequence
• Fifth order Markov model calculates the probability

of hexamer bases and it can detect nucleotide
correlations in in coding regions more accurately.
• Interpolated Markov Model (IMM) is a variable
length model and samples the largest number of
sequence patterns with k ranging from 1 to 8.
Gene content and length distribution of
prokaryotic genes
TYPICAL ATYPICAL
Ranges from 100 Shorter or longer

to 500 amino with different
acids with a nucleotide statistics.
nucleotide Genes tend to escape
distribution detection when
typical of the typical gene model is
organism. used.
Gene finding programs in prokaryotes
• The programs are based on HMM/IMM.
 GeneMark.hmm (microbial genomes)
 Glimmer (UNIX program from TIGR). Computation
involves two steps viz. model building & gene
prediction.
 FGENESB (bacterial sequences). It uses Vertibi
algorithm & linear discriminant analysis (LDA).
 RBSfinder- Searches from ribosomal binding site or
shine dalgarno sequence for prediction of translation
initiation site.
Performance evaluation
• Evaluation of accuracy of a predicted program.
• Parameters- Actual
True False
 Sensitivity TP FP PP=TP+FP
True
Predicted
False FN TN PN=FN+TN
AP=TP+FN AN=FP+TN
 Specificity
TP TP= True Positive
Sp 
TP FP FP= False Positive
 Misclassification rates: FN= False Negative
FN FP
   TN= True Negative
AP AN
1 
 Normalized specificity: 
1   
 Correlation Coefficient: A single parameter to describe
accuracy.
It provides overall measures of accuracy, ranging from -1 to +1.
(TPTN ) (FN  FP)
CC  .
(TP FN ) (TN  FP) (TP FP) (TN  FN )
2  Sn  Sp
F
Sn  Sp

TP  TN
SMC 
TP  TN  FP  FN
1  TP TP TN TN 
ACP      ,
n TP FN TP FP TN  FP TN  FN 
AC  2(ACP 0.5).
Sensitivity Ability to include correct predictions. It is the
fraction of known genes correctly predicted.
Specificity Ability to exclude incorrect predictions. It is the
fraction of predicted genes that correspond to true genes.
 Both are the proportion of true signals.
High High Accurate
High Low Over predict
Low High Conservative & lacks

predictive power
These terms can be applied at the whole-gene,
whole-exon, or individual nucleotide level.
Spliced Alignment Algorithm
• Perform pairwise alignment with large gaps in one
sequence (introns)
– Align genomic DNA with cDNA, EST or protein
• Score semi-conserved sequences at spliced junctions
• Score coding constraints in translated exons.
Splice Site Detection
 Information Content:
Ii  2  f iB
BU ,C , A,G
log 2 ( f iB )
. I
 Extent of Splice Signal Window: I i  I  196
i : ith position in sequence

Ī : average information content over all positions
i>20 from splice site
Ī : average standard deviation of Ī
Eukaryotic gene prediction
• Genomes are much larger than prokaryotes(10Mbp to
670 Gbp).
• Low gene density.
• Space between genes is very large and rich in
repetitive sequences & transposable elements.
• Splitting of genes by intervening noncoding sequences
(introns) and joining of coding sequences (exons).
• Splice junctions follow GT-AG rule.
• An intron at the 5’ splice junction has a consensus
motif GTAAGT and that at 3’ end NCAG.
Donor Acceptor
Site Site
GT AG
exon 1 exon 2
• Genes have a high density of CG dinucleotides near

the transcription start site. This region is CpG island. It
helps to identify the transcription initiation site of an
eukaryotic gene.
• Some post-transcriptional modification occur with the
transcript to become mature mRNA viz. Capping,
Splicing and Polyadenylation.
o CAPPING: Occurs at the 5’ end of the transcript. It
involves methylation at the initial residue of the
RNA.
o SPLICING: Process of removal of introns and
joining of exons. It involves a large RNA-protein
complex called spliceosome.
o POLYADENYLATION: Addition of a stretch of As
(~250) at the 3’ end of the RNA. The process is
accomplished by poly-A polymerase.
Gene finding programs in EUkaryotes
• Three categories of algorithms
 Ab Initio based-
It joins the exons in correct order. Two signals->
a) Gene signals: a small pattern within the genomic
DNA including putative splice sites, start and stop
sites of transcription or translation, branch points,
transcription factor binding sites, recognizable
consensus sequences.
b) Gene content: a region of genomic DNA including
nucleotide and amino acid distribution, Synonymous
codon usage and hexamer frequencies.
 Neural network based algorithm
-Composed of network of mathematical variables.
-Multiple layers like input, output and hidden layers.
-GRAIL (Splice junctions, start and stop codons, poly-A
sites, promoters and CpG islands). It scans the query
sequence with windows of variable lengths & scores.
 Discriminant analysis
-Linear Discriminant Analysis (LDA) represents 2D
graph of coding signals vs. all possible 3’ splice site
positions; a diagonal line.
-Quadratic Discriminant Analysis (QDA) represents
quadratic function; a curved line.
-FGENES (LDA)
-FGENESH [Find Genes] (HMMs)
-FGENESH_C (Similarity based)
-FGENESH+ (Combination of ab initio & similarity
based)
-MZEF [Michael Zhang’s Exon Finder] (QDA)
 HMMs
-GENSCAN (Fifth order HMMs); combination of
hexamer frequencies with coding signals; probability
score P>0.5
-HMMgene (Conditional Maximum Likelihood);
combination of ab initio & homology-based algorithm
 Homology-based-
Exon structures and sequences of related species are
highly conserved.
Comparison of homologous sequences derived from
cDNA or Expressed Sequence Tags (ESTs).
-GenomeScan (Combination of GENSCAN prediction
results with BLASTX similarity searches)
-EST2Genome (Intron-exon boundaries); Comparison
of an EST sequence with a genomic DNA sequence
-SGP-1 [Syntenic Gene Prediction] (Similar to EST2)
-TwinScan (gene-finding server; similar to
GenomeScan)
 Consensus-based-
Combination of results of multiple programs based
on consensus.
Improvement of specificity by correcting false
positives & problem of overprediction.
Lowered sensitivity & missed predictons.
-GeneComber (Combination of HMMgene &
GenScan prediction results)
-DIGIT (Combination of FGENESH, GENSCAN &
HMMgene)
Performance evaluation
• Sensitivity & specificity should be defined on the
levels of nucleotides, exons and entire genes.
• At exon level,
(Sn +Sp )
CC=
2
ME= proportion of missed exons & missed genes
WE= proportion of wrongly predicted exons & wrong
genes
cONCLUSION
• The computational prediction of genes are most
important process in genome & sequence analysis.
• The prediction can be easy for prokaryotes because
of non-interrupted genes. HMMs based predictions
provide best accuracy.
• Current algorithms are categorized ab initio,
homology & consensus based. The statistical &
homology information generate improved
performance of gene finding.
• With this advancement of computational techniques
the gene prediction process will become more
feasible.
REFERENCES
 http://www.4ulr.com/products/currentprotocols/bioinformatics.html
 http://proxy.lib.iastate.edu:2103/nrg/journal/v3/n9/full/nrg890_fs.html
 http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html
 Xiong J.; Essential bioinformatics; QH 324.2.X56 2006
THANK YOU

Lva1 App6891 PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lva1 App6891 PDF

Hochgeladen von

Copyright:

Verfügbare Formate

GENE PREDICTION

• Abinitio- gene signals, intron splice, transcription

pre-mRNA Cap- -Poly(A)

mRNA Cap- -Poly(A)

Donor site Acceptor site

• Fifth order Markov model calculates the probability

Ranges from 100 Shorter or longer

High High Accurate

High Low Over predict

Low High Conservative & lacks

i : ith position in sequence

• Genes have a high density of CG dinucleotides near

Das könnte Ihnen auch gefallen